Invariance of visual operations at the level of 

receptive fields 



Tony Lindeberg 

Department of Computational Biology, 
School of Computer Science and Communication, 
KTH Royal Institute of Technology, 
Stockholm, Sweden 



Abstract 

The brain is able to maintain a stable perception although the visual stimuli vary substantially 
on the retina due to geometric transformations and lighting variations in the environment. This 
paper presents a theory for achieving basic invariance properties already at the level of receptive 
fields. 

Specifically, the presented framework comprises (i) local scaling transformations caused by 
objects of different size and at different distances to the observer, (ii) locally linearized image 
deformations caused by variations in the viewing direction in relation to the object, (iii) locally 
linearized relative motions between the object and the observer and (iv) local multiplicative 
intensity transformations caused by illumination variations. 

The receptive field model can be derived by necessity from symmetry properties of the 
environment and leads to predictions about receptive field profiles in good agreement with 
receptive field profiles measured by cell recordings in mammalian vision. Indeed, the receptive 
field profiles in the retina, LGN and VI are close to ideal to what is motivated by the idealized 
requirements. 

By complementing receptive field measurements with selection mechanisms over the pa- 
rameters in the receptive field families, it is shown how true invariance of receptive field re- 
sponses can be obtained under scaling transformations, affine transformations and Galilean 
transformations. Thereby, the framework provides a mathematically well-founded and biolog- 
ically plausible model for how basic invariance properties can be achieved already at the level 
of receptive fields and support invariant recognition of objects and events under variations in 
viewpoint, retinal size, object motion and illumination. 

The theory can explain the different shapes of receptive field profiles found in biological 
vision, which are tuned to different sizes and orientations in the image domain as well as to 
different image velocities in space-time, from a requirement that the visual system should be 
invariant to the natural types of image transformations that occur in its environment. 
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Author summary 



Receptive field profiles registered by cell recordings have shown that mammalian vision has de- 
veloped receptive fields tuned to different sizes and orientations in the image domain as well as to 
different image velocities in space-time. This article presents a theoretical model by which families 
of idealized receptive field profiles can be derived mathematically from a small set of basic assump- 
tions that correspond to structural properties of the environment. The article also presents a theory 
for how basic invariance properties to variations in scale, viewing direction and relative motion can 
be obtained from the output of such receptive fields, using complementary selection mechanisms 
that operate over the output of families of receptive fields tuned to different parameters. Thereby, 
the theory shows how basic invariance properties of a visual system can be obtained already at the 
level of receptive fields, and we can explain the different shapes of receptive field profiles found in 
biological vision from a requirement that the visual system should be invariant to the natural types 
of image transformations that occur in its environment. 
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1 Introduction 



We maintain a stable perception of our environment although the brightness patterns reaching our 
eyes undergo substantial changes. This shows that our visual system possesses invariance properties 
with respect to several types of image transformations: 

If you approach an object, it will change its size on the retina. Nevertheless, the perception 
remains the same, which reflects a scale invariance. It is well-known that humans and other animals 
have functionally important invariance properties with respect to variations in scale. For example, 
Biederm an and Cooper| ([1992]) demonstrated that reaction times for recognition of line drawings 



were independent of whether the primed object was presented at the same or a different size as 
when originally viewed. Logothet is et al.| ( p"995| ) found that there are cells in the inferior temporal 
cortex (IT) of monkeys for which the magnitude of the cell's response is the same whether the 
stimulus subtended 1° or 6° of visual angle. |Ito et al.| ( |l995| ) found that about 20 percent of anterior 
IT cells responded to ranges of size variations greater than 4 octaves, whereas about 40 percent 
responded to size ranges less than 2 octaves. |Furmanski and Engel| ( |2000[ ) found that learning with 
application to object recognition transfers across changes in image size. The neural mechanisms 
underlying object recognition are rapid and lead to scale-invariant properties as soon as 100-300 
ms after stimulus onset (Hun g et aL||2005| ). 

In a similar manner, if you rotate an object in front of you, the projected brightness pattern will 
be deformed on the retina, typically by different amounts in different directions. To first order of 
approximation, such image deformations can be modelled by local affine transformations, which 



include the effects of in-plane rotations and perspective foreshortening. For example, Logothetis 



et al.|fL995] ) and |Booth and Rolls] ( |1998| ) have shown that in the monkey IT cortex there are both 



neurons that respond selectively to particular views of familiar objects as well as populations of 
single neurons that have view-invariant representations over different views of familiar objects. 
Edel man and Bulthoff | ( | 1 992| ) have on the other hand shown that the time for recognizing unfamiliar 
objects from novel views increases with the 3-D rotation angle between the presented and previously 
seen views. Still, subjects are able to recognize unfamiliar objects from novel views, provided that 
the 3-D rotation is moderate. 

If an object moves in front of you, it may in addition to a translation also lead to a time- 
dependent motion field in the brightness pattern on the retina. You may or you may not fixate 
on the object. Depending on the relative motion between the object and the observer, this motion 
field can be modelled by local Galilean transformations. Regarding biological counterparts of such 
relative motions, Rodman ~and Albright] ( |1989| ) and |Lagae et"aL] ( |1993[ ) have shown that in area 
MT of monkeys there are neurons with high selectivity to the speed and direction of visual motion 
over large ranges of image velocities. |Petersen et al.| ([l985) have shown that there are neurons in 
area MT that adapt their response properties to the direction and velocity of motion. |Smeets and 



Brenner| ( [l994[ ) have shown that reaction times for motion perception can be different for absolute 



and relative motion and that reaction times may specifically depend on the relative motion between 
the object and the background. When Einstein derived his relativity theory, he used as a basic 
assumption the requirement that the equations should be invariant under Galilean transformations 
dEinstein[[T920l ). 

The measured luminosity of surface patterns in the world may in turn vary over several orders 
of magnitude. Nevertheless we are able to preserve the identity of an object as we move it out of 
or into a shade, which reflects important invariance properties under intensity transformations. The 
Weber-Fechner law states that the ratio of an increment threshold AI in image luminosity for a 
just noticeable difference in relation to the background intensity / is constant over large ranges of 
luminosity variations ( Pal mer] [1999] pages 671-672). The pupil of the eye and the sensitivity of 
the photoreceptors are continuously adapting to ambient illumination ( Hurley] |2002| ). 



To be able to function robustly in a complex natural world, the visual system must be able to 
deal with these image transformations in an efficient and appropriate manner to maintain a stable 
perception as the brightness pattern changes on the retina. One specific approach is by computing 



1 



invariant features whose values or representations remain unchanged under basic image transfor- 
mations. A weaker but nevertheless highly useful approach is by computing visual representations 
that possess suitable covariance properties, which means that the representations are transformed 
in a well-behaved and well-understood manner under corresponding image transformations. A 
covariant image representation can then in turn constitute the basis for computing truly invariant 
image representations, and thus enable invariant visual recognition processes at the systems level, 
in analogy with corresponding invariance principles as postulated for biological vision systems by 
different authors ( [Rollsl [T994| |DiCarlo and Maunsell[ [2000} |Grimes and Rao} [2005} |Quiroga et aT| 
20051 IDiCarlo and Cox[ [20071 |Goris and de Beeck[|2009l ). 



The subject of this paper is to introduce a computational framework for modelling receptive 
fields at the earliest stages in the visual system corresponding to the retina, LGN and VI and to 
show how this framework allows for basic invariance or covariance properties of visual operations 
with respect to all the above mentioned phenomena. This framework can be derived from symmetry 
properties of the natural environment (jLindeberg} |20 1 1 [ [20 1 2|) and leads to predictions of receptive 



field profiles in good agreement with receptive measurements reported in the literature (Hubel and 



|Wiesel[[T959l[T962l[DeAngelis et al.[[T995l|DeAngelis and Anzai| [20041 [Hubel and Wiesel[[2005] ). 
Specifically, explicit phenomenological models will be given of LGN neurons and simple cells in 
VI and be compared to related models in terms of Gabor functions ( [Marceljal |1980 



Palmer, 1987a b), differences of Gaussians (Rodieck, 1965) or Gaussian derivatives ( |Koenderink 



IJones and 



and van Doorn[ [T9871 |Young{ |1987[ [Young et all |2001[ |Young and Lesperance[ |2001[ ). Notably, 



the evolution properties of the receptive field profiles in this model can be described by diffusion 
equations and are therefore suitable for implementation on a biological architecture, since the com- 
putations can be expressed in terms of communications between neighbouring computational units, 
where either a single computational unit or a group of computational units may be interpreted as 
corresponding to a neuron. Specifically, computational models involving diffusion equations arise 
in mean field theory for approximating the computations that are performed by populations of neu- 
Omurtag et al.| ( [2000| ); |Mattia and Guidice| ( [200^ ; |Faugeras et al.| ( [2009l ). 
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Combined with complementary selection mechanisms over receptive fields at different scales 
( [Lindebergl [T998 ), receptive fields adapted to different affine image deformations ( |Lindeberg and 



Garding 1997 ) and different Galilean motions (Lindeberg et al.[ [2004[ [Lindeberg 



2011), it will 



also be shown how true invariance of receptive field responses can be obtained with respect to lo- 
cal scaling transformations, affine transformations and Galilean transformations. These selection 
mechanisms are based on either (i) the computation of local extrema over the parameters of the 
receptive fields or alternatively based on (ii) the comparisons of local receptive field responses to 
affine invariant or Galilean fixed-point requirements (to be described later). On a neural archi- 
tecture, these geometric invariance properties are therefore compatible with a routing mechanism 
( [Olshausen et al.[[T993| ) that operates on the output from families of receptive fields that are tuned to 
different scales, spatial orientations and image velocities. In this respect, the resulting approach will 
bear similarity to the approach by Riesenh uber and Poggio| ( [T999| ), where receptive field responses 
at different scales are routed forward by a soft winner-take-all mechanism, with the theoretical 
additions that the invariance properties over scale can here be formally proven and the presented 
framework specifically states how the receptive fields should be normalized over scale. Further- 
more, our approach extends to true and provable invariance properties under more general affine 
and Galilean transformations. 

A direct consequence of these invariance properties established for receptive field responses 
is that they can be propagated to invariance properties of visual operations at higher levels, and 
thus enable invariant recognition of visual objects and events under variations in viewing direction, 
retinal size, object motion and illumination. In this way, the presented framework provides a com- 
putational theory for how basic invariance properties of a visual system can achieved already at the 
level of receptive fields. Another consequence is that the presented framework could be used for 
explaining the families of receptive field profiles tuned to different orientations and image velocities 
in space and space-time that have been observed in biological vision from a requirement of that the 
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corresponding receptive field responses should be invariant or covariant under corresponding im- 
age transformations. A main purpose of this article is to provide a synthesis where such structural 
components are combined into a coherent framework for achieving basic invariance properties of a 
visual system and relating these results, which have been derived mathematically, to corresponding 
functional properties of neurons in a biological vision system. 

Another major aim of this article is to try to bridge the gap between computer vision and biolog- 
ical vision, by demonstrating how concepts originally developed for purposes in computer vision 
can be related to corresponding notions in computational neuroscience and biological vision. In par- 
ticular, we will argue for explicit incorporation of basic image transformations into computational 
neuroscience models of vision. If such image transformations are not appropriately modelled and if 
the model is then exposed to test data that contain image variations outside the domain of variabili- 
ties that are spanned by the training data, then an artificial neuron model may have severe problems 
with robustness. If on the other hand the covariance properties corresponding to the natural variabil- 
ities in the world underlying the formation of natural image statistics are explicitly modelled and if 
corresponding invariance properties are built into the computational neuroscience model and also 
used in the learning stage, we argue that it should be possible to increase the robustness of a neuro- 
inspired artificial vision system to natural image variations. Specifically, we will present explicit 
computational mechanism for obtaining true scale invariance, affine invariance, Galilean invariance 
and illumination invariance for image measurements in terms of local receptive field responses. 

Interestingly, the proposed framework for receptive fields can be derived by necessity from a 
mathematical analysis based on symmetry requirements with respect to the above mentioned image 
transformations in combination with a few additional requirements concerning the internal structure 
and computations in the first stages of a vision system that will be described in more detail below. 
In these respects, the framework can be regarded as both (i) a canonical mathematical model for 
the first stages of processing in an idealized vision system and as (ii) a plausible computational 
model for biological vision. Specifically, compared to previous approaches of learning receptive 



field properties and visual models from the statistics of natural image data ( |Field[ [1987, va n der 
Schaaf and van Hateren[ [T996t IQlshausen and Field! [T9961 |Rao and Ballard! [T9981 [Simoncelli and 



Qlsh ausenl |2001[ |Geisler[ |2008| ) the proposed theoretical model makes it possible to determine 
spatial and spatio-temporal receptive fields from first principles that reflect symmetry properties of 
the environment and thus without need for any explicit training stage or gathering of representative 
image data. In relation to such learning based models, the proposed normative approach can be 
seen as describing the solutions that an ideal learning based system may converge to, if exposed to 
a sufficiently large and representative set of natural image data. The framework for achieving true 
invariance properties of receptive field responses is also theoretically strong in the sense that the 
invariance properties can be formally proven given the idealized model of receptive fields. 

In their survey of our knowledge of the early visual system, [Carandini et al.| ( |2005] ) emphasize 
the need for functional models to establish a link between neural biology and perception. More 
recently, Ein hauser and Konig] ( |2010| ) argue for the need for normative approaches in vision. This 
paper can be seen as developing the consequences of such ways of reasoning by showing how basic 
invariance properties of visual processes at the systems level can be obtained already at the level of 
receptive fields, using a normative approach. 



2 Model for early visual pathway in an idealized vision system 

In the following we will state a number of basic requirements concerning the earliest levels of pro- 
cessing in an idealized vision system, which will be used for deriving idealized functional models of 
receptive fields. Let us stress that the aim is not to model specific properties of human vision or any 
other species. Instead the goal is to describe basic characteristics of the image formation process 
and the computations that are performed after the registration of image luminosity on the retina. 
These assumptions will then be used for narrowing down the class of possible image operations 
that are compatible with structural requirements, which reflect symmetry properties of the environ- 
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ment. Thereafter, it will be shown how this approach applies to modelling of biological receptive 
fields and how the resulting receptive fields can be regarded as biologically plausible. 

For simplicity, we will assume that the image measurements are performed on a planar retina 
under perspective projection. With appropriate modifications, a corresponding treatment can be 
performed with a spherical camera geometry. 

Let us therefore assume that the vision system receives image data that are either defined on a 
(i) purely spatial domain f(x) or a (ii) spatio-temporal domain f(x, t) with x — (x\, x 2 ) T . Let us 
regard the purpose of the earliest levels of visual representations as computing a family of internal 
representations L from /, whose output can be used as input to different types of visual modules. 
In biological terms, this would correspond to a similar type of sharing as VI produces output for 
several downstream areas such as V2, V4 and V5/MT. 

An important requirement on these early levels of processing is that we would like them to be 
uncommitted operations without being too specifically adapted to a particular task that would limit 
the applicability for other visual tasks. We would also desire a uniform structure on the first stages 
of visual computations. 

2.1 Spatial (time-independent) image data 

Concerning terminology, we will use the convention that a receptive field refers to a region Q in 
visual space over which some computations are being performed. These computations will be 
represented by an operator T, whose support region is Q. Generally, the notion of a receptive field 
will used to refer to both the operator T and its support region Q. In some cases when referring 
specifically to the support region only, we will refer to it as the support region of the receptive 
field. 

Given a purely spatial image / : R 2 — >► R, let us consider the problem of defining a family of 
internal representations 

£(•; s) = T 8 f (l) 

for some family of operators T s that are indexed by some parameter s, where s = (si, • • • , sjy) 
may be a multi-dimensional parameter with N dimensions. (The dot "•" at the position of the first 
argument x of L means that that L(-; s) when given a fixed value of the parameter s only should be 
regarded as a function over x.) In the following we shall state a number of structural requirements 
on a visual front-end as motivated by the types of computations that are to be performed at the 
earliest levels of processing in combination with symmetry properties of the surrounding world. 

Linearity. Initially, it is natural to require T s to be a linear operator, such that 

T s (aifi + a 2 f 2 ) = aiTsfi + a 2 T s f 2 ( 2 ) 

holds for all functions /i, f 2 : R 2 — > R and all scalar constants ai, a 2 E R. An underlying motiva- 
tion to this linearity requirement is that the earliest levels of visual processing should make as few 
irreversible decisions as possible. 

Linearity also implies that a number of special properties of receptive fields (to be described 
below) will transfer to spatial and spatio-temporal derivatives of these and do therefore imply that 
different types of image structures will be treated in a similar manner irrespective of what types of 
linear filters they are captured by. 

Translational invariance. Let us also require T s to be a shift-invariant operator in the sense that 
it commutes with the shift operator S Ax defined by (SA x f)(x) = f(x — Ax), such that 

T s (S Ax f) = S Ax (T 8 f) (3) 
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holds for all Ax. The motivation behind this assumption is the basic requirement that the perception 
of a visual object should be the same irrespective of its position in the image plane. Alternatively 
stated, the operator T s can be said to be homogeneous across spaceV\ 



Convolution structure. Together, the assumptions of linearity and shift-invariance imply that the 
internal representations L(-; s) are given by convolution transformations 

L(x; s) = (T(.; s) * f)(x) = [ T(£; s) f(x - d£ (4) 

where T(-; s) denotes some family of convolution kernels. Later, we will refer to these convolution 
kernels as receptive fields. 



The issue of scale. A fundamental property of the convolution operation is that it may reflect 
different types of image structures depending on the spatial extent (the width) of the convolution 
kernel. 

• Convolution with a large support kernel will have the ability to respond to phenomena at 

coarse scales. 

• A kernel with small support may on the other hand only capture phenomena at fine scales. 

From this viewpoint it is natural to associate an interpretation of scale with the parameter s and we 
will assume that the limit case of the internal representations when s tend to zero should correspond 
to original image pattern / 

lipL(.; 5 ) = limT s / = /. (5) 



Semi-group structure. From the interpretation of s as a scale parameter, it is natural to require 
the image operators T s to form a semi-grou^\owQr s 

(6) 

with a corresponding semi-group structure for the convolution kernels 

T(.; si)*T(.; s 2 ) = T(-; 8 1 +s 2 ). (7) 

Then, the transformation between any different and ordered scale levels s s and s 2 with 52 > s\ will 
obey the cascade property 

L(-; s 2 ) = T(-; s 2 - Sl ) * T(-; si) * / = T(-; s 2 - Sl ) * L(-; si) (8) 

i.e. a similar type of transformation as from the original data /. An image representation with these 
properties is referred to as a multi-scale representation. 

x ¥ov us humans and other higher mammals, the retina is obviously not translationally invariant. Instead, finer scale 
receptive fields are concentrated to the fovea in such a way that the minimum receptive field size increases essentially 
linearily with eccentricity. With respect to such a sensor space, the assumption about translational invariance should be 
taken as an idealized model for the region in space where there are receptive fields above a certain size. 

Concerning the parameterization of this semi-group, we will in the specific case of a one-dimensional (scalar) scale 
parameter assume the parameter s G E to have a direct interpretation of scale, whereas in the case of a multi-dimensional 
parameter s — (si, . . . , sn) £ ^ N , these parameters could also encode for other properties of the convolution kernels 
in terms of the orientation in image space or the degree of elongation e = cri/a2, where a± and g<2 denote the spatial 
extents in different directions. The convolution kernels will, however, not be be required to form a semi-group over any 
type of parameterization, such as the parameters or e. Instead, we will assume that there exists some parameterization 
s for which an additive linear semi-group structure can be defined and from which the latter types of parameters can then 
be derived. 
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Self-similarity over scale. Regarding the family of convolution kernels used for computing a 
multi-scale representation, it is also natural to require them to self-similar over scale, such that if s 
is a one-dimensional scale parameter then all the kernels correspond to rescaled copies 

of some prototype kernel T for some transformation's) of the scale parameter. If s E is a 
multi-dimensional scale parameter, the requirement of self-similarity over scale can be generalized 
into 

r < i!s >=i»f (*>~'*> <io) 

where ip(s) now denotes a non-singular 2 x 2-dimensional matrix regarding a 2-D image domain 
and ^(s) _1 its inverse. With this definition, a multi-scale representation with a scalar scale param- 
eter s E R+ will be based on uniform rescalings of the prototype kernel, whereas a multi-scale 
representation based on a multi-dimensional scale parameter might also allow for rotations as well 
as non-uniform affine deformations of the prototype kernel. 



Infinitesimal generator. For theoretical analysis it is preferable if the scale parameter can be 
treated as a continuous scale parameter and if image representations between adjacent levels of 
scale can be related by partial differential equations. Such relations can be expressed if the semi- 
group possesses an infinitesimal generator (Hille and Philli ps) [1957| ) 



BL = lim r <' "\*t-f (11, 
/40 h 

and implies that image representations between adjacent levels of scale can be related by differential 
evolution equations; for a scalar scale parameter of the form 

d s L(x- s) = (BL)(x; s) (12) 

for some operator B and for an TV-dimensional scale parameter of the form 

(V u L)(x\ s) = {B{u) L){x- s) = { Ul Bi + • • • + u N B N ) L(x; s) (13) 

for any positive direction u = (ui, . . . , ujsf) in the parameter space with u\ > for every i. In 



( jLindeberg} |2011| ) it is shown how such differential relationships can be ensured given a proper 
selection of functional spaces and sufficient regularity requirements over space x and scale s in 
terms of Sobolev norms. We shall therefore henceforth regard the internal representations L(-; s) 
as differentiable with respect to both the image space and scale parameter(s). 



Non-enhancement of local extrema. For the internal representations L(-; s) that are computed 
from the original image data / it is in addition essential that operators T s do not generate new 
structures in the representations at coarser scales that do not correspond to simplifications of corre- 
sponding image structures in the original image data. 

A particularly useful way of formalizing this requirement is that local extrema must not be 
enhanced with increasing scale. In other worlds, if a point (xq; so) is a local (spatial) maximum 

3 The reason for introducing a function ip for transforming the scale parameter s into a scaling factor ip(s) in image 
space, is that the requirement of a semi-group structure {6} does not imply any restriction on how parameter s should 
be related to image measurements in dimensions of length — the semi-group structure only implies an abstract ordering 
relation between coarser and finer scales S2 > si that could also be satisfied for any monotonously increasing transfor- 
mation of the parameter s. For the Gaussian scale-space concept with a scalar scale parameter and given by ( [22] ) this 
transformation is given by a = ip(s) = y/s, whereas for the affine Gaussian scale-space concept given by {27} it is given 
by the matrix square root function ip(s) = H 1//2 , where E denotes the covariance matrix that describes the spatial extent 
and the orientation of the affine Gaussian kernels. 



6 



of the mapping x \-t L(x; so) then the value must not increase with scale. Similarly, if a point 
(xo; so) is a local (spatial) minimum of the mapping x H> L{x\ so), then the value must not 
decrease with scale. Given the above mentioned differentiability property with respect to scale, we 
say that the multi-scale representation constitutes a scale-space representation if it for a scalar scale 
parameter satisfies the following conditions: 



<9 s L(x ; s ) < 
<9 s L(x ; s ) > 

or for a multi-parameter scale- space 

(V u L)(x ; s ) <0 
(V u L)(x ; s ) >0 

for any positive direction u = (ui, . . 
figure [l}. 



at any non-degenerate local maximum, 
at any non-degenerate local minimum, 

at any non-degenerate local maximum, 
at any non-degenerate local minimum, 



(14) 
(15) 

(16) 
(17) 



, un) in the parameter space with u\ > for every i (see 



> x 




Figure 1 : The requirement of non-enhancement of local extrema is a way of restricting the class of possible 
image operations by formalizing the notion that new image structures must not be created with increasing 
scale, by requiring that the value at a local maximum must not increase and that the value at a local minimum 
must not decrease. 



Rotational invariance. If we restrict ourselves to a scale-space representation based on a scalar 
(one-dimensional) scale parameter s E R+, then it is natural to require the scale-space kernels to 
be rotationally symmetric 

T(x\ s) = h{^x\ + x\\ s) (18) 

for some one-dimensional function h(-; s) : R —> R. Such a symmetry requirement can be moti- 
vated by the requirement that in the absence of further information, all spatial directions should be 
equally treated (isotropy). 

For a scale-space representation based on a multi-dimensional scale parameter, one may also 
consider a weaker requirement of rotational invariance at the level of a family of kernels, for exam- 
ple regarding a set of elongated kernels with different orientations in image space. Then, the family 
of kernels may capture image data of different orientation in a rotationally invariant manner, for 
example if all image orientations are explicitly represented or if the receptive fields corresponding 
to different orientations in image space can be related by linear combinations. 

Affine co variance. The perspective mapping from the 3-D world to the 2-D image space gives 
rises to image deformations in the image domain. If we approximate the non-linear perspective 
mapping from a surface pattern in the world to the image plane by a local linear transformation (the 
derivative), then we can model this deformation by an affine transformation 

f = Af corresponding to f'{x) = f(x) with x = A x + b. (19) 

To ensure that the internal representations behave nicely under image deformations, it is natural to 
require a possibility of relating them under affine transformations 

L'fV; s f ) = L(x; s) corresponding to Ta( s ) Af = AT S f (20) 
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for some transformation s f = A(s) of the scale parameter. Unfortunately, it turns out that affine 
covariance cannot be achieved with a scalar scale parameter and linear operations. As will be shown 
below, it can, however, be achieved with a 3-parameter linear scale-space. 



2.2 Necessity result concerning spatial receptive fields 

Given the above mentioned requirements it can be shown that if we assume (i) linearity, (ii) shift- 
invariance over space, (iii) semi-group property over scale, (iv) sufficient regularity properties over 
space and scale and (v) non-enhancement of local extrema, then the scale-space representation over 
a 2-D spatial domain must satisfy (Lindeb erg}|2011[ theorem 5, page 42) 



1 



d s L= -Vi (EqV^L) - 6& V X L 



(21) 



for some 2x2 covariance matrix Hq and some 2-D vector So with V x = (d Xl ,d X2 ) T . If we in 
addition require the convolution kernels to be mirror symmetric through the origin T(— x\ s) — 
T(x; s) then the offset vector So must be zero. There are two special cases within this class of 
operations that are particularly worth emphasizing. 



Gaussian receptive fields. If we require the corresponding convolution kernels to be rotationally 
symmetric, then it follows that they will be Gaussians 



T(x] s) = g(x; s) 



1 

2^s 



-x T x/2s 



1 

2^s 



-{x\+x\)l2s 



with corresponding Gaussian derivative operators 



{d xa g){x\ s) = (d x «i x °2g)(x 1 ,x 2 ; s) = (d x ^ig)(x 1 ; s){d x ^g){x 2 ] s) 



(22) 



(23) 



(with a = {ai , a 2 ) where a\ and a 2 denote the order of differentiation in the x\- and x 2 -directions, 
respectively) as shown in figure [2] with the corresponding one-dimensional Gaussian kernel and its 
Gaussian derivatives of the form 



g(x 1 ; s) 
9 Xl (xi; s) 




-x\/2s 



g(x 1 ; s) 



X\ 



-x\/2s 



{x\ - S) 



g(xn s) 



(A - s) 



(24) 
(25) 
(26) 



Such Gaussian functions have been previously used for modelling biological vision by ( |Young 



|1987[ ), who has shown that there are receptive fields in the striate cortex that can be well modelled 
by Gaussian derivatives up to order four. More generally, these Gaussian derivative operators can 
be used as a general basis for expressing image operations such as feature detection, feature clas- 
sification, surface shape, image matching and image-based recognition ( |Witkin[[l983 



T984| |Koenderink and van Doorn[ [T992| |Lindeberg] |1994a|bj |Florack[ [T9971 [terH 



Koenderink 



aar Romeny 



2003||Lindebe rg, 2008]); see specifically QSchiele and Crowley! [2000| |Linde and Lindeberg[ |2004 



Lowe[|2004t|Bay et al.[|2008[|Linde and Lindebergl|2012| ) for explicit approaches for object recog- 
nition based on Gaussian receptive fields or approximations thereof. 



Affine-adapted Gaussian receptive fields. If we relax the requirement of rotational symmetry 
and relax it into the requirement of mirror symmetry through the origin, then it follows that the 



convolution kernels must instead be affine Gaussian kernels (Lindeberg 1994a) 



TO; s) = g{x; E) 



1 

27iVdet S 



^~ 1 x/2 



(27) 
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Figure 2: Spatial receptive fields formed by the 2-D Gaussian kernel with its partial derivatives up to order 
two. The corresponding family of receptive fields is closed under translations, rotations and scaling transfor- 
mations. 



where S denotes any symmetric positive semi-definite 2x2 matrix. This affine scale-space concept 
is closed under affine transformations, meaning that if we for affine related images 



f L (0 = fR(v) where 77 = A£ + 6. 
define corresponding scale-space representations according to 



(28) 



(29) 



then these scale-space representations will be related according to (Lindeberg 1994a, Lindeberg 



and Garding 1997) 



L(x; E L ) = R{y\ Er) where Efl = AEz,A T and y = Ax + b. 



(30) 



In other words, given that an image /l is affine transformed into an image /r it will always be 
possible to find a transformation between the scale parameters s l and sr in the two domains that 
make it possible to match the corresponding derived internal representations L(-; sl) andi?(-; sr). 
Figure [3] shows a few examples of such kernels in different directions with the covariance matrix 
parameterized according to 



E = 



Ai cos 2 + A 2 sin 2 (Ai - A 2 ) cos sin 
(Ai - A2) cos sin Ai sin 2 + A 2 cos 2 



(31) 



with Ai and A 2 denoting the eigenvalues and the orientation. Directional derivatives of these 
kernels can in turn be obtained from linear combinations of partial derivative operators according 
to 

m / m \ 

<9^m L = (cos (p d Xl + sin <p d X2 ) m L = ^ ( j cos fc <p sin m_/c <p L x k x ™-k . (32) 

With respect to biological vision, these kernels can be used for modelling receptive fields that are 
oriented in the spatial domain, as will be described in connection with equation ( [57] ) in section |4j 
For computer vision they can be used for computing affine invariant image descriptors for e.g. 
cues to surface shape, image-based matching and recognition (Lindeberg, 1994a; Lindeber g and| 
Garding[^997l|Baumbergl |2000t [Mikolajczyk and Schmid[ [2004] |Tuytelaars and van Goofl [20041 



Lazebnik eta^2005[|Rothganger et al.[|2006| ). 




Figure 3: Spatial receptive fields formed by affine Gaussian kernels and directional derivatives of these. The 
corresponding family of receptive fields is closed under general affine transformations of the spatial domain, 
including translations, rotations, scaling transformations and perspective foreshortening. 



Note on receptive fields formed from derivatives of the convolution kernels. Due to the lin- 



earity of the differential equation pT) , which has been derived by necessity from the structural 
requirements, it follows that also the result of applying a linear operator V to the solution L will 
also satisfy the differential equation, however, with a different initial condition 



lim(Z>L)(-; s)=Vf. (33) 

The result of applying a linear operator V to the scale-space representation L will therefore satisfy 
the above mentioned structural requirements of linearity, shift invariance, the weaker form of ro- 
tational invariance at the group level and non-enhancement of local extrema, with the semi-group 
structure ([6]) replaced by the cascade property 



(2?L)(-; 8 2 ) = T(-; s 2 - 8l ) * (PL)(-; (34) 

Then, one may ask if any linear operator V would be reasonable? From the requirement of scale in- 
variance, however, if follows that that the operator V must not be allowed to have non-infinitesimal 
support, since a non-infinitesimal support «§o > would violate the requirement of self- similarity 
over scale ^ and it would not be possible to perform image measurements at a scale level lower 
than so. Thus, any receptive field operator derived from the scale-space representation in a manner 
compatible with the structural arguments must correspond to local derivatives. In the illustrations 
above, partial derivatives and directional derivatives up to order two have been shown. 

For directional derivatives that have been derived from elongated kernels whose underlying 
zero-order convolution kernels are not rotationally symmetric, it should be noted that we have 
aligned the directions of the directional derivative operators to the orientations of the underlying 
kernels. A structural motivation for making such an alignment can be obtained from a requirement 
of a weaker form of rotational symmetry at the group level. If we would like the family of receptive 
fields to be rotationally symmetric as a group, then it is natural to require the directional derivative 
operators to be transformed in a similar way as the underlying kernels. 

Receptive fields in terms of derivatives of the convolution kernels derived by necessity do also 
have additional advantages if one adds a further structural requirement of invariance under addi- 
tive intensity transformations f(x) \-> f(x) + C. A zero-order receptive field will be affected by 
such an intensity transformation, whereas higher order derivatives are invariant under additive in- 
tensity transformations. As will be described in section[6j this form of invariance has a particularly 
interesting physical interpretation with regard to a logarithmic intensity scale. 
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2.3 Spatio-temporal image data 

For spatio-temporal image data f(x,t) defined on a 2+1-D spatio-temporal domain with (x,t) = 
(xi, x 2l t) it is natural to inherit the symmetry requirements over the spatial domain. In addition, 
the following structural requirements can be imposed motivated by the special nature of time and 
space-time. 

Galilean covariance. For time-dependent spatio-temporal image data, we may have relative mo- 
tions between objects in the world and the observer, where a constant velocity translational motion 
can be modelled by a Galilean transformation 

f = Gv f corresponding to /'(V, t r ) = f(x, t) with x' = x + v t. (35) 

To enable a consistent visual interpretation under different relative motions, it is natural to require 
that it should be possible to transform internal representations L(-, •; s) that are computed from 
spatio-temporal image data under different relative motions 

L'(x f , t'; s') = L(x, t; s) corresponding to T Gv ( s ) G v f = G V T S f. (36) 

Such a property is referred to as Galilean covariance. 

Temporal causality. For a vision system that interacts in with the environment in a real-time 
setting, a fundamental constraint on the convolution kernels (the spatio-temporal receptive fields) 
is that they cannot access data from the future. Hence, they must be time-causal in the sense that 
convolution kernel must be zero for any relative time moment that would imply access to the future: 

T(x,t; s) = if t < 0. (37) 

Time-recursivity. Another fundamental constraint on a real-time system is that it cannot keep a 
record of everything that has happened in the past. Hence, the computations must be based on a 
limited internal temporal buffer M(x, t), which should provide: 

• a sufficient record of past information and 

• sufficient information to update its internal state when new information arrives. 

A particularly useful solution is to use the internal representations L at different temporal scales 
also used as the memory buffer of the past. In ( |Lindeberg[|2011[ section 5.1.3, page 57) it is shown 



that such a requirement can be formalized by a time-recursive updating rule of the form 

L(x,t 2 ;s 2 ,r) = / / U(x-^t 2 -t 1 ; s 2 -s u r,C)L(f,ti; si,()d(d£ 

+ I I B(x-^t 2 -u] s 2 ,t) f(£,u)d£du 

which is required to hold for any pair of scale levels 52 > s± and any two time moments t 2 > t\, 
where 



the kernel U updates the internal state, 

the kernel B incorporates new image data into the representation, 

r is the temporal scale and £ an integration variable referring to internal temporal buffers at 
different temporal scales. 
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Non-enhancement of local extrema in a time-recursive setting. For a time-recursive spatio- 
temporal visual front-end it is also natural to generalize the notion of non-enhancement of local 
extrema, such that it is required to hold both with respect to increasing spatial scales s and evolution 
over time t. Thus, if at some spatial scale so and time moment to a point (xo, to) is a local maximum 
(minimum) for the mapping 

(x,t) -» L(x,t ; <s , r) (38) 

then for every positive direction u — {u\, . . . , un, un+i) in the N + 1-dimensional space spanned 
by (5, t), the directional derivative (V u L){x 1 1; s, r) must satisfy 

(V u L)(xo,to; so,to)<0 at any local maximum, (39) 
(V u L)(xo,to; so,to)>0 at any local minimum. (40) 



3 Necessity results concerning spatio-temporal receptive fields 

We shall now describe how these structural requirements restrict the class of possible spatio-temporal 
receptive fields. 



3.1 Non-causal spatio-temporal receptive fields 

If one disregards the requirements of temporal causality and time recursivity and instead requires 
(i) linearity, (ii) shift invariance over space and time, (iii) semi-group property over spatial and 
temporal scales, (iv) sufficient regularity properties over space, time and spatio-temporal scales 
and (v) non-enhancement of local extrema for a multi-parameter scale-space, then it follows from 
( jLindebergj |2011[ theorem 5, page 42) that the scale-space representation over a 2+1-D spatio- 
temporal domain must satisfy 

dsL = \ Vf X(t) (S V (a , t) L) - 8l V {x , t) L (41) 

for some 3x3 covariance matrix Hq and some 3-D vector 5q with V(~ t ) = (d xi , d X2 , d t ) T . 

In terms of convolution kernels, the zero-order receptive fields will then be spatio-temporal 
Gaussian kernels 

withp = (x,£) T = (xi 1 x 2l t) T , 

Ai cos 2 9 + X 2 sin 2 9 + vfX t (X 2 — Ai) cos 9 sin# + viv 2 X t vi^t \ 
S 5 = ( (A 2 - Ai) cos 9 sin 9 + v ± v 2 X t Xi sin 2 9 + A 2 cos 2 9 + v\X t v 2 X t (43) 
viX t v 2 X t X t J 

(44) 




where (i) Ai, X 2 and 9 determine the spatial extent, (ii) At determines the temporal extent, (iii) v = 
(vi, v 2 ) denotes the image velocity and (iv) 5 represents a temporal delay. From the corresponding 
Gaussian spatio-temporal scale-space 

L(x,t; ^ space ,v,r) = (#(•,•; E space , v,r) * /(•, (45) 

spatio-temporal derivatives can then be defined according to 

L xat p(x,t; ^ space ,v,T) = (d xat pL)(x,t; ^ space ,v,r) (46) 
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Figure 4: Non-causal and space-time separable spatio-temporal receptive fields over 1+1-D space-time as 
generated by the Gaussian spatio-temporal scale-space model with v = 0. This family of receptive fields 
is closed under rescalings of the spatial and temporal dimensions. (Horizontal axis: space x. Vertical axis: 
time t.) 





Figure 5: Non-causal and velocity-adapted spatio-temporal receptive fields over 1+1-D space-time as gen- 
erated by the Gaussian spatio-temporal scale- space model for a non-zero image velocity v. This family of 
receptive fields is closed under rescalings of the spatial and temporal dimensions as well as Galilean trans- 
formations. (Horizontal axis: space x. Vertical axis: time t.) 



with corresponding velocity-adapted temporal derivatives 



dt = v T V x + d t = vi d xi + v 2 d X2 + d t 



(47) 



as illustrated in figure [4] and figure [5] for the case of a 1+1-D space-time. 

Motivated by the requirement of Galilean covariance, it is natural to align the directions v in 
space-time for which these velocity-adapted spatio-temporal derivatives are computed to the veloc- 
ity values used in the underlying zero-order spatio-temporal kernels, since the resulting velocity- 
adapted spatio-temporal derivatives will then be Galilean covariant. Such receptive fields can be 
used for modelling spatio-temporal receptive fields in biological vision ( jLindebergj |1997[ |Young 
et al.||200T|| Young and Lesperance[|2001[ |Lindeberg, 201 1) and for computing spatio-temporal im- 
age features and Galilean invariant image descriptors for spatio-temporal recognition in computer 



vision ( [Laptev and Lindeberg[|2003| |2004a|bt |Laptev et aT|[2007HWillems et al.[ |2008). 
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Transformation property under Galilean transformations. Under a Galilean transformation 
of space-time (35 ), in matrix form written 




p = G v p corresponding to I x' 2 = I 1 V2 I ( #2 , (48) 



the corresponding Gaussian spatio-temporal representations are related in an algebraically similar 
way ( [28] - [30] ) as the affine Gaussian scale-space with the affine transformation matrix A replaced by 



a Galilean transformation matrix G v . In other words, if two spatio-temporal image patterns /l and 
/r are related by a Galilean transformation encompassing a translation Ap = (Axi, A^2, At) in 
space-time 

fUO = fR(v) where r, = G v £ + Ap (49) 
and if corresponding spatio-temporal scale- space representations are defined according to 

L(-; S L ) = <?(•; S L ) * f L (-), R(-; S fl ) = <?(•; S R ) * f R (-) (50) 



for general spatio-temporal covariance matrices and of the form ( [43] ), then these spatio- 
temporal scale-space representations will be related according to 

L(x; E L ) = R(y; E R ) where E fl = G v E L and y = G v x + Ap. (51) 

Given two spatio-temporal image patterns that are related by a Galilean transformation, such as aris- 
ing when an object is observed with different relative motion between the object and the viewing 
direction of the observer, it will therefore be possible to perfectly match the spatio-temporal re- 
ceptive field responses computed from the different spatio-temporal image patterns. Such a perfect 
matching would, however, not be possible without velocity adaptation, i.e., if the spatio-temporal 
receptive fields would be computed using space-time separable receptive fields only. 



3.2 Time-causal spatio-temporal receptive fields 

If we on the other hand with regard to real-time biological vision want to respect both tempo- 
ral causality and temporal recursivity, we obtain a different family of time-causal spatio-temporal 
receptive fields. Given the requirements of (i) linearity, (ii) shift invariance over space and time, 
(iii) temporal causality, (iv) time-recursivity, (v) semi-group property over spatial scales s and time t 
T Sl ,t! T S2 ,t 2 = 7^1+52,^1+^2' ( y i) sufficient regularity properties over space, time and spatio-temporal 
scales and (vii) non-enhancement of local extrema in a time-recursive setting^} then it follows that 



the time-causal spatio-temporal scale-space must satisfy the system of diffusion equations (Linde 



berg[|2011j equations (88-89), page 52, theorem 17, page 78) 



d s L = -Vi (SV X L) (52) 
d t L = -v T V x L + X -d TT L (53) 



3.1 



and the time-causal model in 



4 Concerning the relations between the non-causal spatio-temporal model in section ■ 
section [3^2] please note that requirement of non-enhancement of local extrema is formulated in different ways in the two 
cases: (i) For the non-causal scale- space model, the condition about non-enhancement condition is based on points that 
are local extrema with respect to both space x and time t. At such points, a sign condition is imposed on the derivative 
in any positive direction over spatial scales s and temporal scale r. (ii) For the time-causal scale- space model, the notion 
of local extrema is based on points that are local extrema with respect to space x and the internal temporal buffers at 
different temporal scales r. At such points, a sign condition is imposed on the derivatives in the parameter space defined 
by the spatial scale parameters s and time t. Thus, in addition to the restriction to time-causal convolution kernels ( [37] ) 
the derivation of the time-causal scale-space model is also based on different structural requirements. 
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for some 2x2 spatial covariance matrix E and some image velocity v with s denoting the spatial 
scale and r the temporal scale. In terms of receptive fields, this spatio-temporal scale-space can be 
computed by convolution kernels of the form 

h(x,t] s, r; E, v) = g(x — vt; s; E) r) 

1 e -(^-^) T E- 1 (^-^)/2s 1 T e -r 2 /2t (54) 



2Wdet E V2^t 3 / 2 

where 



• g(x — vt; 5; S) is a velocity-adapted 2-D affine Gaussian kernel with covariance matrix E 
and 



• r) is a time-causal smoothing kernel over time with temporal scale parameter r. 




Figure 6: Time-causal and space-time separable spatio-temporal receptive fields over a 1+1 -D space-time as 
generated by the time-causal spatio-temporal scale- space model with v = 0. This family of receptive fields 
is closed under rescalings of the spatial and temporal dimensions. (Horizontal axis: space x. Vertical axis: 
time t.) 




Figure 7: Time-causal and velocity-adapted spatio-temporal receptive fields over a 1+1 -D space-time as gen- 
erated by the time-causal spatio-temporal scale-space model with v = 0. This family of receptive fields is 
closed under rescalings of the spatial and temporal dimensions as well as Galilean transformations. (Hori- 
zontal axis: space x. Vertical axis: time t.) 



15 



From these kernels, spatio-temporal partial derivatives and velocity-adapted derivatives can be com- 
puted in a corresponding manner ([46]) and ( |47| ) as for the Gaussian spatio-temporal scale-space 
concept; see figure [6] and figure [7] for illustrations in the case of a 1+1 -D space-time. 



4 Computational modelling of biological receptive fields 

An attractive property of the presented framework for early receptive fields is that it generates 
receptive field profiles in good agreement with receptive field profiles found by cell recordings in the 
retina, LGN and VI of higher mammals. DeAngelis et al.| ( [l995| ) and DeAngelis and Anzai ( 2004| ) 



present overviews of receptive fields in the joint space-time domain. As outlined in (Lindeberg 



2011[ section 6), the Gaussian and time-causal scale-space concepts presented here can be used for 
generating predictions of receptive field profiles that are qualitatively very similar to all the spatial 
and spatio-temporal receptive fields presented in these surveys. 

4.1 LGN neurons 

In the LGN, most cells (i) have approximately circular-center surround and most receptive fields are 
(ii) space-time separable (DeAngelis et al.[[l995[|DeAngelis and Anzai[|2004|). A corresponding 



idealized scale-space model for such receptive fields can be expressed as 

h>LGN(xu x 2, *; 5, t) = ±(d XlXl + d X2X2 )g{xi,x 2 ] s) d t m h(t\ t) (55) 

where 

• ± determines the polarity (on-center/off- surround vs: off-center/on-surround), 

• d XlXl + d X2X2 denotes the spatial Laplacian operator, 
g{x\,X2', s) denotes a rotationally symmetric spatial Gaussian, 

d t > denotes a temporal derivative operator with respect to a possibly self-similar transforma- 



tion of time t' = t a or t f = log t such that d t > = t K dt for some constant k ( |Lindebergl|2011 
section 5.1, pages 59-61), 

h(t; t) is a temporal smoothing kernel over time corresponding to the time-causal smoothing 

or a non-causal time-shifted Gaussian kernel 



kernel 0(t; r) = ^re^ ' 2t in <|54 
g (t; r,S) = -^= e -^- 5 ^/ 2T according tol42|), 



2ttt 

• n is the order of temporal differentiation, 

• s is the spatial scale parameter and 

• r is the temporal scale parameter. 

Figure [8] shows a comparison between the spatial component of a receptive field in the LGN with a 
Laplacian of the Gaussian. This model can also be used for modelling on-center/off- surround and 
off-center/on-surround receptive fields in the retina. 

Regarding the spatial domain, the model in terms of spatial Laplacians of Gaussians (d xlX1 + 
8x2x2) 9{ x ii x 2] s) is closely related to differences of Gaussians, which have previously been 
shown to be good approximation of the spatial variation of receptive fields in the retina and the 
LGN ( |Rodieck[ [1965). This property follows from the fact that the rotationally symmetric Gaus- 



sian satisfies the isotropic diffusion equation 

Vl(x; t) = 8l L(x; t) « ^j±Mzjfe " = D0G ^ (56) 
2 v ' ; v ' ; At At 

which implies that differences of Gaussians can be interpreted as approximations of derivatives over 
scale and hence to Laplacian responses. 
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Figure 8: (left) Receptive fields in the LGN have approximately circular center- surround responses in the 
spatial domain, as reported by DeAngelis et al.|p995|). (right) In terms of Gaussian derivatives, this spa- 



tial response profile can be modelled by the Laplacian of the Gaussian V g(x,y; s) — (x 2 + y 2 
2s)/(2tts 3 ) exp(-(x 2 + y 2 )/2s), here with s = 0.4 deg 2 . 



4.2 Simple cells in VI 

In VI the receptive fields are generally different from the receptive fields in the LGN in the sense 
that they are (i) oriented in the spatial domain and (ii) sensitive to specific stimulus velocities 



(DeAngelis et al. 1995, DeAngelis and Anzai 2004). 



4.2.1 Spatial dependencies 

We can express a scale-space model for the spatial component of this orientation dependency ac- 
cording to 

' g{x\,x 2 \ E) (57) 



h sp ace(xi,X2] s) (cos if d Xl + sin ip d x ym 



y x 2 j 



where 



• = cos <pd xi + sin ip d X2 is a directional derivative operator, 

• m is the order of spatial differentiation and 

• g{x\, X2] S) is an affine Gaussian kernel with spatial covariance matrix S as can be param- 



eterized according to ( 3 1 ) 



where the direction ip of the directional derivative operator should preferably be aligned to the 
orientation 9 of one of the eigenvectors of S. 

In the specific case when the covariance matrix is proportional to a unit matrix S = s I, with 
s denoting the spatial scale parameter, these directional derivatives correspond to regular Gaussian 



derivatives as proposed as a model for spatial receptive fields by Koenderink and van Doom (1987) 



and |Koenderink and van Doorn| ( |1992[ ). The use of non-isotropic covariance matrices do on the other 
hand allow for a higher degree of orientation selectivity. Moreover, by having a family of affine 
adapted kernels tuned to a family of covariance matrices with different orientations and different 
ratios between the scale parameters in the two directions, the family as a whole can represent 
affine covariance which makes it possible to perfectly match corresponding receptive field responses 
between different views obtained under variations of the viewing direction in relation to the object. 

Figure [10] shows illustrations of affine receptive fields of different orientations and degrees of 
elongation as they arise if we assume that the set of all 3-D objects in the world have an approxi- 
mately uniform distribution of surface orientations in 3-D space and if we furthermore assume that 
we observe these objects from a uniform distribution of viewing directions that are not directly 
coupled to properties of the objects. 
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Figure 9: (left) Simple cells in the cerebral cortex do usually have strong directional preference in the spatial 
domain. In terms of Gaussian derivatives, this spatial response can be modelled as a directional derivative 
of an elongated affine Gaussian kernel, as reported by peAngelis et al. ( |1995 ). (right) First-order directional 
derivatives of anisotropic affine Gaussian kernels, here aligned to the coordinate directions d x g(x, y; E) = 
d x g(x, y; X x , X y ) = - f-1 / '(2tt yj '\ x \ y ) exp(-x^2X x - y 2 /2X y ) and here with X x = 0.2 deg 2 and X y = 
2 deg 2 , can be used as a model for simple cells with a strong directional preference. 



' • • • 



• Ml 
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Figure 10: Affine Gaussian receptive fields generated for a set of co variance matrices E that correspond to 
an approximately uniform distribution on a hemisphere in the 3-D environment, which is then projected onto 
a 2-D image plane, (left) Zero-order receptive fields, (right) First-order receptive fields. 



This idealized model of elongated receptive fields can also be extended to recurrent intracortical 
feedback mechanisms as formulated by |Somers et al.| ( p~995| ) and|Smnpolins ky and Shapley] ( | 1 997| ) 
by starting from the equivalent formulation in terms of the non-isotropic diffusion equation ( [2T] ) 



d s L = ]^ T X (E V X L) 



(58) 



with the covariance matrix Eo locally adapteoj^Jto the statistics of image data in a neighbourhood of 
each image point; see We ickert| ([T998 ) and |Almansa and Lm deberg ( |2000| ) for applications of this 
idea to the enhancement of local directional image structures in computer vision. 



5 By the use of locally adapted feedback, the resulting evolution equation does not obey the original linearity and 
shift-invariance (homogeneity) requirements used for deriving the idealized affine Gaussian receptive field model, if the 
covariance matrices Eo are determined from a properties of the image data that are determined in a non-linear way. For 
a fixed set of covariance matrices E at any image point, the evolution equation will still be linear and will specifically 
obey non-enhancement of local extrema. In this respect, the resulting model could be regarded as a simplest form of 
non-linear extension of the idealized receptive field model. 
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Relations to modelling by Gabor functions: Gabor functions have been frequently used for 
modelling spatial receptive fields (Marceljal |1980[ pones and Palmer[[T987a|b| ), motivated by their 



property of minimizing the uncertainty relation. This motivation can, however, be questioned on 



both theoretical and empirical grounds. Stork and Wilson ( 1990) argue that (i) only complex- valued 



Gabor functions that cannot describe single receptive field minimize the uncertainty relation, (ii) the 
real functions that minimize this relation are Gaussian derivatives rather than Gabor functions and 
(iii) comparisons among Gabor and alternative fits to both psychophysical and physiological data 
have shown that in many cases other functions (including Gaussian derivatives) provide better fits 
than Gabor functions do. 

Conceptually, the ripples of the Gabor functions, which are given by complex sine waves, are 
related to the ripples of Gaussian derivatives, which are given by Hermite functions. A Gabor func- 
tion, however, requires the specification of a scale parameter and a frequency, whereas a Gaussian 
derivative requires a scale parameter and the order of differentiation. With the Gaussian derivative 
model, receptive fields of different orders can be mutually related by derivative operations, and be 
computed from each other by nearest-neighbour operations. The zero-order receptive fields as well 
as the derivative based receptive fields can be modelled by diffusion equations, and can therefore 
be implemented by computations between neighbouring computational units. 

In relation to invariance properties, the family of affine Gaussian kernels is closed under affine 
image deformations, whereas the family of Gabor functions obtained by multiplying rotationally 
symmetric Gaussians with sine and cosine waves is not closed under affine image deformations. 
This means that it is not possibly to compute truly affine invariant image representations from 
such Gabor functions. Instead, given a pair of images that are related by a non-uniform image 
deformation, the lack of affine covariance implies that there will be a systematic bias in image 
representations derived from such Gabor functions, corresponding to the difference between the 
backprojected Gabor functions in the two image domains. If using receptive profiles defined from 
directional derivatives of affine Gaussian kernels, it will on the other hand be possible to compute 
affine invariant image representations. 

In this respect, the Gaussian derivative model can be regarded as simpler, it can be related to 
image measurements by differential geometry, be derived axiomatically from symmetry principles, 
be computed from a minimal set of connections and allows for provable invariance properties under 



non-uniform (affine) image deformations. [Young (1987) has more generally shown how spatial 



receptive fields in cats and monkeys can be well modelled by Gaussian derivatives up to order four. 
4.2.2 Spatio-temporal dependencies 

To model spatio-temporal receptive fields in the joint space-time domain, we can then state scale- 
space models of simple cells in VI using either 

• non-causal Gaussian spatio-temporal derivative kernels 

hGaussian(x ll X 2l t; S,T,V,5) = (d* 1 <9^ dfrg){xi, X 2 ,t] S,T,V,5) (59) 

• time-causal spatio-temporal derivative kernels 

h t ime-causal{xi,X2,t\ S,T,v) = (d x <*i x °2 d[nh)(xi, X 2 , t\ S,T,v) (60) 



with the non-causal Gaussian spatio-temporal kernels g{x\,x 2 ,t\ s,r, 5) according to (42), 



the time-causal spatio-temporal kernels h{x\,x 2 ,t\ s, r, v) according to d54j) and spatio-temporal 



derivatives d x <*i x <*2d t p or velocity-adapted derivatives d x ^i x ^d^ of these according to (46) and 
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g xx (x,t; s,r,5) 



~9xxx(x,t] s,t,5) 






Figure 11: (top row) Examples of non-separable spatio-temporal receptive field profiles in striate cortex 
as reported by DeAnge lls et al.| ( [1995| ): (top left) a receptive reminiscent of a second-order derivative in 
tilted space-time (compare with the left column in figure 11 ) (top right) a receptive reminiscent of a third- 
order derivative in tilted space-time (compare with the right column in figure [TT]). (middle and bottom 
rows) Non-separable spatio-temporal receptive fields obtained by applying velocity- adapted second- and 
third-order derivative operations in space-time to spatio-temporal smoothing kernels generated by the spatio- 
temporal scale-space concept, (middle left) Gaussian spatio-temporal kernel g xx (x,t; s,r,v,5) with s — 
0.5 deg 2 ,r = 50 2 ms 2 ,v = 0.006 deg/ms, S = 100 ms. (middle right) Gaussian spatio-temporal kernel 
9xxx(x, t\ s, r, v, 8) with s = 0.5 deg 2 , r = 60 2 ms 2 , v = 0.006 deg/ms, 5 = 130 ms. (lower left) Time- 
causal spatio-temporal kernel h xx (x,t; s,t,v) with s = 0.4 deg 2 ,r = 15 ms 1 / 2 ,?; = 0.006 deg/ms. 
(lower right) Time-causal spatio-temporal kernel h xxx (x, t\ s, r, v) with s — 0.4 deg 2 , r = 15 ms 1 / 2 , v = 
0.006 deg/ms. (Horizontal dimension: space x. Vertical dimension: time t.) 
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For a general orientation of receptive fields with respect to the spatial coordinate systems, the 
receptive fields in these scale-space models can be jointly described in the form 



h s imple-ceii(xi,x 2 ,t; s, r, v, E) =(cos(pd Xl + sin ip d X2 ) ai (sin (p d Xl -coscpd, 



X2j 



\«2 



(vi d Xl + v 2 d X2 + d t ) n 

g{x\ - vit, X2 - V2t\ s E) h(t\ r) 



(61) 



where 



d^p — cos(fd Xl + sin(/?d X2 and 9 



*±<p 



smcpd Xl — cos(fd X2 denote spatial directional 



derivative operators according to (32) in two orthogonal directions <p and J_<p, 



• ai > and > denote the orders of differentiation in the two orthogonal directions in 
the spatial domain with the overall spatial order of differentiation m = a\ + 

• vi d Xl + V2 d X2 + dt denotes a velocity-adapted temporal derivative operator, 

• v = (v i, V2) denotes the image velocity, 

• n denotes the order of temporal differentiation, 



g[x\ — v\t,X2 — V2t; E) denotes a spatial affine Gaussian kernel according to (27) that 
translates with image velocity v = (vi, V2) in space-time, 

E denotes a spatial covariance matrix that can be parameterized by two eigenvalues Ai and 
A2 as well as an orientation 9 of the form ( |3T] ), 



h{t] t) is a temporal smoothing kernel over time corresponding to the time-causal smoothing 
kernel r) = r e"^ in ^ 

\2 1 



) or a non-causal time-shifted Gaussian kernel 



g (t] t,5) = 7^ e_(t_(5)2/2r according to^42|, 



• s denotes the spatial scale and 

• r denotes the temporal scale. 



Figure[TT]shows examples of non-separable spatio-temporal receptive fields measured by cell record 
ings in VI with corresponding velocity-adapted spatio-temporal receptive fields obtained using the 
Gaussian scale-space and the time-causal scale-space; see also |Young et al.|p001| ) and | Young and 
Lesperance] ( |2001| ) for a closely related approach based on Gaussian spatio-temporal derivatives 
although using a different type of parameterization and (Lindeberg, 1997 ) for closely related earlier 
work. These scale-space models should be regarded as idealized functional and phenomenological 
models of receptive fields that predict how computations occur in a visual system and whose actual 
realization can then be implemented in different ways depending on available hardware or wetware. 

Work has also been performed on learning receptive field properties and visual models from 
the statistics of natural image data ( Field} [T987t| van der Schaaf and van Haterenj [T996[|01shausen 
and Field[[T996l|Rao and Ballard[[T998HSimoncelli and Olshausenj [200T| [Geislerl [2008] ) and been 
shown to lead to the formation of similar receptive fields as found in biological vision. The proposed 
theoretical model on the other hand makes it possible to determine such receptive fields from theo- 
retical first principles that reflect symmetry properties of the environment and thus without need for 
any explicit training stage or selection of representative image data. This normative approach can 
therefore be seen as describing the solution that an idealized learning based system may converge 
to, if exposed to a sufficiently large and representative set of natural image data. 

An interesting observation that can be made from the similarities between the receptive field 
families derived by necessity from the assumptions and receptive profiles found by cell recordings 
in biological vision, is that receptive fields in the retina, LGN and VI of higher mammals are 
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very close to ideal in view of the stated structural requirements/symmetry properties (|Lindeberg 



2012| ). In this sense, biological vision can be seen as having adapted very well to the transformation 
properties of the outside world and the transformations that occur when a three-dimensional world 
is projected to a two-dimensional image domain. 

5 Mechanisms for obtaining true geometric invariances 

An important property of the above mentioned families of spatial and spatio-temporal receptive 
fields is that they obey basic covariance properties under 

• rescalings of the spatial and temporal dimensions, 

• affine transformations of the spatial domain and 

• Galilean transformations of space-time; 



see [Lindeberg] ( |20lTj section 5.1.2, page 56) for more precise statements and explicit equations. 
These properties do in turn allow the vision system to handle: 

• image data acquired with different spatial and temporal sampling rates, including image data 
that are sampled with different spatial resolution on a foveated sensor with decreasing sam- 
pling rate towards the periphery and spatio-temporal events that occur at different speed (fast 
vs. slow), 

• image structures of different spatial and/or temporal extent, including objects of different size 
in the world and events with longer or shorter duration over time, 

• objects at different distances from the camera, 

• the linear component of perspective deformations {e.g. perspective foreshortening) and 

• the linear component of relative motions between objects in the world and the observer. 

In these respects, the presented receptive field models ensure that visual representations will be 
well-behaved under basic geometric transformations in the image formation process. 

This framework can then in turn be used as a basis for defining truly invariant representations. 
In the following, we shall describe basic approaches for this that have been developed in the area 
of computer vision, and have been demonstrated to be powerful mechanisms for achieving scale 
invariance, affine invariance and Galilean invariance for real-world data. Since these mechanisms 
are expressed at a functional level of receptive fields, we propose that corresponding mechanisms 
can be applied to neural models and a for providing a mathematically well-founded framework for 
explaining invariance properties in computational models. 

5.1 Scale invariance 

Given a set of receptive fields that operate over some range of scale, a general approach for obtaining 
scale invariance is by performing scale selection from local extrema over scale of scale-normalized 



derivatives (Lindeberg[ 1994a 1998) 



d^=s^ 2 d Xl d^ 2 = s^ 2 d X2 (62) 

where 7 E [0, 1] is a free parameter that can be adjusted to the task and in some cases can be chosen 
as 7 = 1. Specifically, it can be shown that if a spatial image f(x) has a local extremum over 
scale at scale so for some position xo in image space, then if we define a rescaled image f'(x') 
by f'{x') = f(x) where x f = axo for some scaling factor a, then there will be a corresponding 
local extremum over scale in the rescaled image f'{x') at scale s f = a 2 so and position x' = a xq 
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( [Lindebergl |1994a[ section 13.2.1) (Linde berg[|1998[ section 4.1). In other words, local extrema 
over scale of scale-normalized derivatives are preserved under scaling transformations and follow 
the scale variations in an appropriate manner. This property also extends to linear and non-linear 
combinations of receptive field responses that correspond to spatial and spatio-temporal derivatives 
of the Gaussian spatial and spatio-temporal scale-space concepts described in section [2] as well as 
to the idealized models of biological receptive fields presented in section |4j 




Figure 12: Illustration of how scale selection can be performed from receptive field responses by computing 
scale-normalized Gaussian derivative operators at different scales and then detecting local extrema over scale. 
Here, so-called scale-space signatures have been computed at the centers of two different lamps at different 
distances to the observer. 



Figure 12 illustrates this idea by performing local scale selection at two different points in a 



spatial image from local extrema over scale of the scale-normalized Laplacian V^ orm L and the 



scale-normalized determinant of the Hessian det % n orm 
ceptive fields at different spatial scales 



L computed from Gaussian-derivative re- 



X\X\ 



+ L 



X2X2/ 



det U r , 



J X2X2 



^x\X2; 



(63) 



From the graphs in this figure, which show the variation over scale of the scale-normalized Lapla- 
cian \/ 2 norm L and the scale-normalized determinant of the Hessian det % norm L as function of ef- 
fective scale log «s, it can be seen that the local extrema over scale are assumed at a finer scale for 
the distant object and at a coarser scale for the nearby object. The ratio between these scale values 
measured in units of the standard deviation a = y/s of the underlying Gaussian kernels corresponds 
to the ratio between the sizes of the projections between of the two lamps in the image domain and 
reflects the ratio between the distances between these objects and the observer. 

Computing image descriptors at scale levels obtained from a scale selection step based on lo- 
cal extrema over scale of scale-normalized receptive field responses or equivalently computing 
image descriptors from local image patches that have been scale normalized with respect to such 
size estimates constitutes a very general approach for obtaining true scale invariance and has been 



successfully applied for different tasks in computer vision ( |Lindeberg| 1 1 999[ |2008| ) including scale 
invariant tracking and object recognition ( |Bretzner and Lindeberg 1998[ Lowe 2004[ |Bay et al" 
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Figure 13: Illustration of how scale normalization can be performed by rescaling local image structures using 
scale information obtained from a scale selection mechanism. Here, the two windows selected in figure 12 
have been transformed to a common scale -invariant reference frame by normalizing them with respect to the 
scale levels at which the scale-normalized Laplacian and the scale-normalized determinant of the Hessian 
respectively assumed their global extrema over scale. Note the similarities of the resulting scale normalized 
representations, although they correspond to physically different objects in the world. 



2008) and estimation of time to collision from temporal size variations in the image domain (Lin- 
deberg and Bretzner[|2003j|Negre et aL|[2008] ). 



Figure [13lillustrates an application of the latter scale normalization approach applied to the two 
windows in figure [T2{ by first detecting local extrema over scale of the scale-normalized Laplacian 



V norra L and the scale-normalized determinant of the Hessian det % norm L and then using these 
scale values for rescaling the two windows to a common reference frame. In theory any image 
measurement derived from the common reference frame will be truly scale invariant. Scale selection 
performed in this way does hence constitute a very general principle for achieving scale invariance 
for image measurements in terms of receptive fields. 

It should be emphasized, however, that there is in principle no need for carrying out the image 



warping in practice as it has been done in figure 13 for the purpose of illustration. On a neural archi 



tecture it may be more efficient to consider a routing mechanism (Olshausen et al. 1993, Wiskott 



2004) that operates on image representations at different scales and selects visual representations 
from the scales at which image features assume their extremum responses over scale. In this respect, 
the resulting model will be qualitatively rather similar to the approach by |Riesenhuber an d Poggio 
( |1999| ), where a SoftMax operation (a soft winner-take-all mechanism) is applied for computing 
receptive field representations at successively higher layers in a hierarchical architecture. Specif- 



ically, the notion of scale-normalized derivatives according to ( |62| ) determines how the receptive 
field responses as modelled by Gaussian derivatives should be normalized between different scale 
levels in such a modelj^] Due to the scale covariant nature of the underlying receptive fields, it fol- 
lows that the visual representations that are routed forward by the maximum selection mechanism 
will be truly scale invariant. Concerning the possible biological implementation of such a maxi- 
mum operation, |Gawne and Martini \l 993 ) have shown that there are neurons in area V4 of monkey 
that respond to two simultaneously presented stimuli that are well predicted by the maximum of the 
response to each stimulus presented separately. 



6 In practice, the scale normalization in equation |63| with 7=1 corresponds to normalizing the underlying Gaussian 
derivative receptive fields {23} to constant Li-norm over scale, whereas other values of 7 / 1 correspond to other 



L p -norms being constant over scale ( [Lindeberg 1998| section 9.1, pages 107-108) 
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5.2 Affine invariance 



Given a set of spatial receptive fields as generated from affine Gaussian kernels ( [27] ) with their 



directional derivatives ( |32| ) for different spatial extents and orientations as specified by different co- 
variance matrices ( |3T] ), the vision system will be faced with the task of interpreting the output from 
the corresponding family of receptive field responses. For example, if we assume that the vision 
system observes a local surface patch in the world, one may ask if some specific selection of filter 
parameters would be particularly suitable for interpreting the data in any given situation. Specifi- 
cally, if the vision system observes the same surface patch from two different viewing directions, 
it would be valuable if the vision system could maintain a stable perception of the surface patch 
although it will be deformed in different ways in the two perspective projections onto the different 
image planes. 

One way of selecting filter responses from such a family of affine receptive fields is by using 
image measurements in terms of the second-moment matrix (structure tensor) 



S deri Sint) 



/ (VL)O; s der )(yL)(w, s der ) T g(x-u; s int ) du (64) 
JueR 2 



where s der is a local scale parameter describing the scale at which spatial derivatives are computed 
and sm is a second integration scale parameter over which local statistics of spatial derivatives 
is accumulated ( |Lindeber^|1994a[|Lindeberg and Garding, |1997| ). These statistics correspond to 



weighted averages of the non-linear combinations of partial derivatives L 2 Xl , L Xl L X2 and L^ 2 using 
a Gaussian function as weight, and could on a biological architecture be performed in a visual area 
that is based on input from VI, for example in V2. 

A useful property of the second-moment matrix is that it transforms in a suitable way under 
affine transformations, as will be described next. If we consider an affine extension of the second- 
moment matrix by replacing the scalar scale parameters s der and Si nt in ( [64] ) by corresponding 
covariance matrices H^ er and T>i nt 

MO; ^der^int) = / (VL)(u; T, der ) ( VL) (u; Y> der ) T g(x - u; Y, int ) du (65) 
JueR 2 

and consider two images f(x) and f(x f ) that are related by an affine transformation x f = Ax 
such that f'(Ax) = f(x), then the corresponding affine second-moment matrices will be related 
according to 

li'(Ax\ AY> der A T ,AY> int A T ) = A~ T ^(x; E der , £ <nt ) A' 1 . (66) 

Specifically, if we can determine covariance matrices Y> der and £^ such that ji{x\ T, der , S^) = 
c\ E^. = C2 for some constants c\ and C2, we obtain & fixed-point that will be preserved under 



affine transformations (Lindeberg 1994a, Lindeberg and Garding 1997). This property can be 



used for signalling if the image measurements that have been performed for a particular setting of 
filter parameters in a family of affine Gaussian receptive fields satisfy the fixed-point requirement. 
If so, they can be used for defining an affine invariant reference framJ^by transforming the local 
image patch with a linear transformation proportional to A = fj}/ 2 . 

It should be noted, however, that the affine transformation A is not uniquely determined by the 



fixed-point requirement ( [65] ), which only determines two of the four parameters, corresponding to 
amount and direction of perspective foreshortening of a local surface pattern, in other words the 
viewing direction in relation to an object centered coordinate system. The two remaining degrees 
of freedom correspond to (i) an overall scaling factor corresponding the viewing distance, which 



7 If the local image pattern is weakly isotropic in the sense that the second-moment computed in the tangent plane of 
the surface is proportional to the unit matrix fi sur f = cl for some constant c ( Garding and Lindeberg 1996 ), then the 
foreshortening caused by the perspective foreshortening will be compensated for by the affine transformation given by 
|66| . For non-isotropic image patterns with jJL sur f ^ cl this interpretation no longer holds, but the affine transformed 
surface pattern will still be affine invariant. 
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can be determined by scale selection as described in section |5TT| and (ii) a free rotation angle, cor- 
responding to the selection of a representative direction in the image plane. If the vertical direction 
is preserved under the perspective transformation, we may therefore not need to determine it or just 
adjust it from an initial estimate. 

oblique view of an object affine invariant reference frame invariant receptive field response 




Figure 14: Illustration of how affine invariance can be achieved by normalization to an affine invariant 
reference frame determined from a second-moment matrix, (left) A grey-level image with an oblique view 
of a book cover, (middle) The result of affine normalization of a central image patch using a series of affine 



transformations proportional to A = /i 1 / 2 until the an affine invariant fixed-point of (66) has been reached, 
(right) An example of computing receptive field responses, here the Laplacian V 2 L, in the affine normalized 
reference frame. Receptive field responses computed in this reference frame will be affine invariant up to an 
undetermined scaling factor and an undetermined rotation angle and therefore invariant with respect to the 
viewing direction of the object in relation to an object centered coordinate system. 



Figure [14] illustrates an application of this idea to an image with an oblique view of a book 
cover. Here, a second-moment matrix fi has been computed over a central window in the original 
image shown in the left figure. Then, an affine transformation matrix A = ii 1 / 2 has been used 
for warping the central window to a reference frame, and this so-called affine shape-adaptation 
process has been repeated until the second-moment matrix in the reference frame is sufficiently 
close to proportional to the unit matrix // « cl. The middle figure shows the result of warping 
a central window in the original image to the resulting affine normalized reference frame. Due to 



the affine invariant property of the fixed point (66), any receptive field response computed in this 



reference frame will be affine invariant up to an undetermined scaling factor and a free rotation 
angle. Hence, this method provides a way of normalizing receptive field responses with respect 
to image transformations outside the similarity group that correspond to variations in the viewing 
direction relative to the object. 

Again it should, however, be emphasized that it is in principle not needed to perform the actual 
image transformation in reality to achieve the affine invariant property. On a neural architecture 
that computes a family of affine receptive fields with different orientations and spatial extents in 
parallel, one can again consider a routing mechanism that selects the receptive field responses from 
those receptive fields whose measurements of second-moment matrices are in best agreement with 
the underlying covariance matrices in relation to the fixed-point property. Then, up to a known 
transformation whose parameters can be computed from the corresponding second-moment matrix, 
these routed receptive field responses will also be affine invariant. 

In the area of computer vision, this idea of affine shape adaptation has been used for defining 
affine invariant image descriptors with successful applications to image matching, recognition and 



estimation of cues to surface shape (Lindeberg and Garding 1997, Baumberg, 2000, Mikolajczyk 



and Schmid[ [2004] |Tuytelaars and van Goolj [2004] |Lazebnik et aL} [2005} [Rothgange r et al.[ [2006] ) . 



On a neural architecture, one can also conceive that a neuron or a group of neurons that are 
adapted to a particular shape of the covariance matrix corresponding to an orientation in space could 
determine if the local image measurements that have been performed for this particular orientation 
in space are in agreement with the fixed-point requirement (|66|). If so, the neuron(s) could respond 
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with a high activity if the local image measurements agree with the filter parameters to which the 
receptive fields are tuned and with a low activity otherwise. Hence, this framework allows for the 
formulation of affine invariant receptive field responses, to support view-invariant recognition at the 
level of groups of oriented receptive fields over a set of different covariance matrices £&. 



5.3 Galilean invariance 

Given a family of spatio-temporal receptive field that are adapted to motions of different image 
velocities v and given an object that moves with some unknown image velocity u in relation to the 
viewing direction, the vision system also faces the problem of how to interpret the output from the 
family of receptive fields. Figure [H] shows an illustration of how receptive field responses may be 
affected by relative motions between objects in the world and the observer. 

If we would know the image velocity u of the object beforehand, it could of course be preferable 
to select receptive field responses from the receptive fields that are adapted to precisely this image 
velocity v = u. A priori, we cannot, however, assume such knowledge, since one of the basic tasks 
in relation to object recognition may be to determine the image velocity of an unknown moving 
object. There are also classes of composed spatio-temporal events consisting of different image 
velocities at different positions x and time moments t in space-time, for which it may not be trivial 
how a representative image velocity could be defined for the spatio-temporal event as a whole. 
Hence, this problem warrants a principled treatment. 

Given spatio-temporal image data f(x,t) = f(p) with a position in space-time denoted by p = 
let us define a Gaussian spatio-temporal scale-space representation L of / by convolution 
with a Gaussian spatio-temporal kernel #(•; S^ er ,^ er ) with spatio-temporal covariance matrix 



T^der of the form (43 ) and with time delay 5^ er . With a spatio-temporal second-moment matrix \i 



over 2+1-D space-time defined according to ( Linde berg} |2011[ equation (191), page 73) 



//(p; Ei,E 2 ,<Ji + (S2)= / (VL(g; £i,(fi))(VL(g; £i, S^f g(p - q; £ 2 ,<J 2 )d?, (67) 



where g(p — q; £2, #2) denotes a second-stage Gaussian smoothing with covariance matrix £2 and 
time delay 5 2 over space-time, it is indeed possible to perform such velocity selection. Consider two 
Galilean-related spatio-temporal image data sets f'(p') = f(p) that are related by a relative image 
velocity u — v such that p r = G u - V p for a Galilean transformation matrix G u - V according to 
([35]). Then, it can be shown that the corresponding spatio-temporal covariance matrices are related 



according to (Lindeberg 2011| equation (193), page 73) 

// = G u -v M G u \. (68) 

Let us introduce the notion of Galilean diagonalization, which corresponds to finding the unique 
Galilean transformation that transforms the spatio-temporal second-moment matrix to block diago- 



nal form with all mixed purely spatio-temporal components being zero pj t = ji' X2t = (Lindeberg 



et al. 2004) 



M = ( ^ X1 X 2 ^X2X2 I • ( 69 ) 


Such a block diagonalization can be obtained if the velocity vector u satisfies 



M X\X2 1-^X2X2 

with the solution 



u = -{fi xx } 1 {Hxt] (71) 
i.e., structurally similar equations as are used for computing optic flow according to the method by 



(Lukas and Kanade 1981 ). It can then be shown that the property of Galilean block diagonalization 
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Figure 15: Illustration of how receptive field responses may be affected by unknown relative motions between 
objects in the world and the observer and of how this effect can be handled by local velocity adaptation. The 
first row shows space-time traces of a walking person taken with (left column) a stabilized camera with the 
viewing direction following the motion of the person and (right column) a stationary camera with a fixed 
viewing direction. The second row shows Laplacian receptive field responses computed in the two domains 
from space-time separable receptive fields without velocity adaptation. In the third row, these receptive field 
responses from the stationary camera have been space-time warped to the reference frame of the stabilized 
camera. As can be seen from the data, the receptive field responses are quite different in the two domains, 
which implies problems if one would try to match them. Hence, spatio-temporal recognition based on space- 
time separable receptive fields only can be a rather difficult problem. In the fourth row, the receptive field 
responses have instead been computed with local velocity adaptation, by computing extremum responses of 
the Laplacian receptive field responses over different image velocities for each point in space-time. In the 
fifth row, the velocity-adapted receptive responses from the stationary camera have been space-time warped 
to the reference frame of the stabilized camera. As can be seen from a comparison with the corresponding 
result obtained for the non-adapted receptive field responses in the third row, the use of velocity adaptation 
implies a better stability of receptive field responses under unknown relative motions between objects in the 



world and the observer. (Adapted from (Laptev and Lindeberg 2004a).) 
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is preserved under Galilean transformations (Lindeberg |2011[ appendix C.4, pages 73-74). Specif- 



ically, the velocity vector associated with the Galilean transformation, that brings a second-moment 
matrix into block diagonal form, is additive under superimposed Galilean transformations. This is 
a very general approach for normalizing local spatio-temporal image patterns, which also applies 
to spatio-temporal patterns that cannot be modelled by a Galilean transformation of an otherwise 
temporally stationary spatial pattern. Specifically, spatio-temporal receptive field responses that can 
be expressed with respect to such a spatio-temporal reference frame will be Galilean invariant. 

These ideas have been applied in computer vision for performing spatio-temporal recognition 



under unknown relative motions between the spatio-temporal events and the observer (Laptev and 



Lindeberg} [2004a; Laptev l^t~aL| |2007| ). Notably the approach in (Lap tev and Lindeberg 



2004a) 



is based on a set of spatio-temporal receptive fields over which simultaneous selection of image 
velocities and spatio-temporal scales is performed. 

Again, it is not necessary to carry out the spatio-temporal normalization in practice to achieve 
Galilean invariance. On a neural architecture based on a family of spatio-temporal receptive fields 
that operate over some set of image velocities in parallel, one may consider a routing mecha- 
nism that selects receptive field responses by judging the degree of agreement with the criterion 
of Galilean diagonalization ( [69] ) and then giving priority to the responses that are most consistent 
with this criterion. Notably, such a computational mechanism will have the ability to respond to dif- 
ferent motions at different spatial and temporal scales and may therefore have the ability to handle 
transparent motion. 

Please, note that all information that is needed for computing the spatio-temporal second- 
moment matrix and the Galilean diagonalization are spatio-temporal averages of the non-linear 
combinations L^ 1? L X1 L X2 , L^ 2 , L Xl L t , L X2 L t and L\ of first-order spatio-temporal derivatives 
and can hence be computed from spatio-temporal receptive fields. On a biological architecture, the 
corresponding information could therefore be computed from the output of VI neurons in combina- 
tion with an additional layer of spatio-temporal smoothing. Thus, similar type of information could 
in principle be computed by a visual motion area with direct access to the output from VI, such as 
V5/MT. 

On a neural architecture, one can also conceive that a neuron or a group of neurons that are 
adapted to a particular image velocity could determine if the local spatio-temporal image measure- 
ments that have been performed for this particular image velocity in space-time are in agreement 
with the fixed-point requirement ( [69] ) of Galilean diagonalization. If so, the neuron(s) could re- 
spond with a high activity if the local measurements agree with the filter parameters to which the 
receptive fields are tuned and with a low activity otherwise. Hence, this framework allows for the 
formulation of Galilean invariant neurons, to support invariant recognition of visual objects under 
unknown relative motions between the object and the observer, provided that this invariance prop- 
erty is formulated at the level of groups of oriented receptive fields over a set of image velocities 

Vk- 



6 Invariance property under illumination variations 

In the treatment so far, we have described how image measurements in terms of receptive fields are 
related to the geometry of space and space-time, under the assumption that the actual image intensi- 
ties from which the receptive field responses are to be computed have been given beforehand. One 
may, however, consider alternative ways of parameterizing the intensity domain by monotonous in- 
tensity transformations that preserve the ordering between the image intensities, and in this respect 
would contain essentially equivalent information. 

Given the huge range of luminosity variations under natural imaging conditions (corresponding 
to a range of the order of 10 10 between the darkest and brightest cases for human vision), it is 
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natural to represent the image luminosities on a logarithmic luminosity scale 



fix) ~\ogI{x) (time-independent images), 
f(x,t) ~ log I{x,t) (spatio-temporal image data). 

Specifically, receptive field responses that are computed from such a logarithmic parameterization 
of the image luminosities can be interpreted physically as a superposition of relative variations of 
surface structure and illumination variations. Given a (i) perspective camera model extended with 
(ii) a thin circular lens for gathering incoming light from different directions and (iii) a Lambertian 
illumination model extended with (iv) a spatially varying albedo factor for modelling the light that is 



reflects from surface patterns in the world, it can be shown ( |Lindeberg| |20 1 2) that a spatial receptive 
field response 

L^(-; s) = L x <*i x <x 2 {', •; s) = d x « % f (73) 
of the image data /, where T s represents the spatial smoothing operator (here corresponding to a 



two-dimensional Gaussian kernel (22)), can be expressed as 



~L x a. — d x ot Tl (\ogp(x) + \ogi(x) + logC cam (/) + V(xj) (74) 



where 



(i) p(x) is a spatially dependent albedo factor that reflects properties of surfaces of objects in 
the environment with the implicit understanding that this entity may in general refer to points 
on different surfaces in the world depending on the viewing direction and thus the image 
position x = (xi,X2), 

(ii) i(x) denotes a spatially dependent illumination field with the implicit understanding that the 
amount of incoming light on different surfaces may be different for different points in the 
world as mapped to corresponding image coordinates x, 

(iii) Ccam(f) = f j represents internal camera parameters with the ratio f = f/d referred to as 
the effective f -number, where d denotes the diameter of the lens and / the focal distance, 

(iv) V(x) = V(xi,X2) = — 21og(l + x\ + x%) represents a geometric natural vignetting effect 
corresponding to the factor log cos 4 ((f)) for a planar image plane, with (p denoting the angle 
between the viewing direction (xi, X2, /) and the surface normal (0, 0, 1) of the image plane. 
(This vignetting term will disappear for a spherical camera model.) 



From the structure of equation ( |74| ) we can note that for any non-zero order of differentiation a > 
0, the influence of internal camera parameters in C cam (f) will disappear because of the spatial 
differentiation with respect to x, and so will the effects be of any other multiplicative exposure 
control mechanising] Furthermore, for any multiplicative illumination variation i f (x) = C i(x), 
where C is a scalar constant, the logarithmic luminosity will be transformed as log i'{x) = log C + 



log i(x), which implies that the dependency on C in (72) will disappear after spatial differentiation. 
Thus, receptive field responses in terms of spatial derivatives are invariant under multiplicative 
illumination variations. 

Specifically, if the illumination field i(x) is constant over the support region of the receptive 
field, the receptive field response will then up to the variations in the natural vignetting V(x) only 
respond to the spatial variations of the albedo factor p(x), i.e., only to variations in the surface pat- 
tern^) in the world. Hence, the receptive field responses will have a direct physical interpretation in 
terms of properties of objects and events in the environment. This result can be seen as a theoretical 
explanation of why recognition methods based on receptive field responses work so well in the area 



8 For biological vision, such multiplicative exposure control mechanisms correspond to adaptations of the luminosity 
on the retina by varying the diameter of the pupil as well as adaptations of the light sensitivity of the photoreceptors to 
the luminosity. 
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Figure 16: Illustration of the effect of computing Laplacian receptive field responses V 2 L from image inten- 
sities defined on (left column) a linear intensity scale f(x) ~ I(x) vs. (right column) a logarithmic intensity 
scale f(x) ~ log I(x) for an image with substantial illumination variations. As can be seen from the figure, 
the magnitudes of the Laplacian receptive field response are substantially higher in the left sunlit part of the 
house compared to the right part in the shade if the Laplacian responses are computed from a linear lumi- 
nosity scale, whereas the difference in amplitude is between the left and the right parts of the house becomes 
substantially lower if the receptive field responses are computed from a logarithmic intensity scale. 



of computer vision. More generally, this result could also be seen as a theoretical explanation of 
how the receptive field responses that are computed in LGN and VI can constitute the foundation 
for the visual operations in higher visual areas in biological vision. 

Notably, the vignetting effect V(x) is independent of the image contents / and could there- 
fore be corrected for given sufficient knowledge about the camera. For spatio-temporal receptive 
fields d x cx t iiL that involve explicit temporal derivatives with f3 > 0, it will furthermore disappear 
altogether, since the vignetting only depends upon the spatial coordinates. 



7 Summary and conclusions 

We have described how the shapes of receptive field profiles in the early visual pathway can be 
constrained from structural symmetry properties of the environment, which include the requirement 
that the receptive field responses should be sufficiently well-behaved (covariant) under basic image 
transformations. We have also shown how these covariance properties of receptive fields enable true 
invariance properties of visual processes at the systems level, if combined with max-like operations 
over the output of receptive field families tuned to different filter parameters. 

The invariance and covariance properties that we have considered include (i) scaling transfor- 
mations to handle objects and substructures of different size as well as objects at different distances 
from the observer, (ii) affine transformations to capture image deformations caused by the per- 
spective mapping under variations of the viewing direction, (iii) Galilean transformation to handle 
unknown relative motions between objects in the world and the observer and (iv) multiplicative 
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intensity transformations to provide robustness to slowly varying illumination variations as well as 
invariance to intensity variations caused by multiplicative exposure control mechanisms. 

These transformations should be interpreted as local approximations of the actual image trans- 
formations, which in general can be assumed to be non-linear. Thus, a sufficient requirement for 
these invariance or covariance properties to be hold in practice and thus enable robust visual recog- 
nition from real-world image data, is that these approximations should hold locally within the sup- 
port region of a given receptive field. Therefore, these theoretical results can be extended to more 
complex scenes by using different local approximations for receptive fields at different spatial or 
spatio-temporal points. 

The presented theory leads to a computational framework for defining spatial and spatio-temporal 
receptive fields from visual data with the attractive properties that: (i) the receptive field profiles 
can be derived by necessity from first principles and (ii) it leads to predictions about receptive 
field profiles in good agreement with receptive fields found by cell recordings in biological vision. 
Specifically, idealized models have been presented for space-time separable receptive fields in the 
retina and LGN and for non-separable simple cells in VI. 

The modelling performed in this article has been performed at a more abstract level of com- 
putation than used in many other computational models, and should therefore be applicable to a 
large variety of neural models provided that their functional properties can be described by ap- 
propriate diffusion equations. These results are therefore very general, since they are based on 
inherent properties of the image formation process, and should therefore have important implica- 
tions for computational modelling of visual processes based on receptive fields. If one accepts the 
assumptions underlying the model, these results should therefore have important implications for 
computational neuroscience, since they hold for any computational model whose functionality is 
compatible with the assumptions. 

Compared to more common approaches of learning receptive field profiles from natural image 
statistics, the proposed framework makes it possible to derive the shapes of idealized receptive fields 
without any need for training data. The proposed framework for invariance and covariance proper- 
ties also adds explanatory value by showing that the families of receptive profiles tuned to different 
orientations in space and image velocities in space-time that can be observed in biological vision 
can be explained from the requirement that the receptive fields should be covariant under basic im- 
age transformations to enable true invariance properties. If the underlying receptive fields would 
not be covariant, then there would be a systematic bias in the visual operations, corresponding to 
the amount of mismatch between the backprojected receptive fields. 

The theory could also be used as a framework for raising questions concerning invariance prop- 
erties of biological vision. As a complement to the fundamental covariance properties, we have 
outlined possible mechanisms for how true invariance under scaling transformations, affine trans- 
formations and Galilean transformations can be obtained already at the level of receptive field re- 
sponses. The presented mechanisms are based on two types of major principles; (i) by detecting 
extremum values of appropriately normalized receptive field responses over variations of the filter 
parameters or (ii) by normalizing the receptive field responses with respect to a preferred reference 
frame that is constructed from criteria that are invariant under the corresponding image transforma- 
tions. These methods have been successfully applied in the area of computer vision and demonstrate 
how the covariance properties of the proposed receptive field model can be used for defining truly 
scale invariant, affine invariant and Galilean invariant visual operations already at the level of recep- 
tive fields, which can then provide a basis for computational mechanisms for invariant recognition 
of visual objects and events at the systems level. 

We have also described how invariance to local multiplicative illumination transformations and 
exposure control mechanisms will be automatically obtained for receptive fields in terms of spatial 
or spatio-temporal derivatives. If we can assume that the illumination varies slowly and can be 
regarded as constant over the support region of the receptive field, the receptive field response will 
therefore have a direct physical interpretation as corresponding to variations in the surface structures 
of objects in environment. Thus, the receptive field responses reflect important physical properties 
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Invariant image representations for invariant recognition 

t 

Selection mechanisms over scale, affine and Galilean parameters s 



Covariant receptive field responses L(-; s) 

t 

Covariant receptive fields with kernels T(-; s) 



Image data / 



Figure 17: Schematic overview of how the covariance properties of the receptive fields in the proposed 
receptive field model lead to covariant image measurements, from which truly invariant image representations 
can then be obtained by complementary selection mechanisms that operate over the parameters s of the 
receptive fields corresponding to variations over scale, affine image deformations and Galilean motions. For 
pure scaling transformations, the parameter s of the receptive fields will be a scalar scale parameter, whereas a 
covariance matrix E is needed to capture more general affine image deformations. For spatio-temporal image 
data, an additional temporal scale parameter r and an additional image velocity parameter v are furthermore 
needed. 



of objects and events in the environment to support visual recognition. 

It should be emphasized, however, that the model has not been constructed to mimic mam- 
malian vision or the vision system in other species. Instead it is intended as an idealized theoretical 
and computational model to capture inherent properties of basic image transformations that any 
computational vision model needs to be confronted with. 

Concerning limitations of the proposed approach, it should be stressed that a basic requirement 
for obtaining true invariance with respect to the image transformations according to the proposed 
invariance mechanisms, is that the vision system has a sufficient number of receptive fields over a 
sufficient range of filter parameters to support invariance over a corresponding range of parameter 
variations. Notably, such a limitation is consistent with the findings from biological vision that the 
scale invariant properties of neurons may only hold over finite ranges of scale variations pto et al.[ 
1995]). 



It should also be noted that the invariance and covariance properties are only guaranteed to 
hold if the same local approximation of the image transformation is valid within the entire support 
region of the receptive field. Thus, complementary mechanisms can be needed to handle, e.g., 
discontinuities in depth, discontinuities in the illumination field or specularities. 

An interesting observation that can be made from the similarities between the receptive field 
families that have been derived by necessity from the assumptions and the receptive profiles found 
by cell recordings in biological vision, is that receptive fields in the retina, LGN and VI of higher 
mammals are very close to ideal in view of the stated structural requirements/symmetry properties. 
In this sense, biological vision can be seen as having adapted very well to the transformation prop- 
erties of the outside world and the transformations that occur when a three-dimensional world is 
projected to a two-dimensional image domain and being exposed to illumination variations. 

Thus, image measurements in terms of receptive fields according to the proposed model can 
(i) be interpreted as corresponding to image features that are either invariant or covariant with 
respect to basic geometric transformations and illumination variations and can (ii) serve as a foun- 
dation for achieving invariant recognition of visual objects at the system level under variations in 
viewpoint, retinal size, object motion and illumination. 

From a background of the presented theory, we can therefore interpret the receptive fields in 
VI as highly dedicated computational units that are very well adapted to enable the computation of 
invariant image representations at higher levels in the visual hierarchy. 
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8 Discussion 



In his recent overview of Bayesian approaches to understanding the brain, |Friston| ( |201 1] ) writes that 
". . . we are trying to infer the causes of our sensations based on a generative model of the world" and 
'. . . if the brain is making inferences about the causes of its sensations then it must have a model of 
the causal relationships (connections) among (hidden) states of the world that cause sensory input. 
It follows that neuronal connections encode (model) causal connections that conspire to produce 
sensory information". He furthermore states that an underlying message in several lines of brain 
research is that the brain is regarded as "optimal in some sense". 

The presented theory can be seen as describing consequences of a similar way of reasoning 
regarding the development of receptive fields in the earliest stages of visual processing. If the 
brain is to handle the large natural variability in image data under basic image transformations, 
such as scaling variations, viewing variations, relative motion or illumination variations, then an 
optimal strategy may be to adapt to these variabilities by making it possible to respond to image 
transformations in terms of invariance or covariance properties. If the receptive fields would not be 
covariant under basic image transformations, then that would imply that some of the variabilities 
in the information could not be appropriately captured by the vision system, which would affect its 
performance. By in addition developing invariance properties at higher levels in a visual hierarchy, 
the brain will be able to deal with natural image transformations in a robust and efficient manner. 

Thus, the proposed theory of receptive fields can be seen as describing basic physical con- 
straints under which a learning based method for the development of receptive fields will operate 
and the solutions to which an optimal adaptive system may converge to. |Field| ( |1987| ) as well as 



Doi and Lewicki (2005b have described how natural images are not random, instead they exhibit 



statistical regularities and have used such statistical regularities for constraining the properties of 
receptive fields. Receptive field profiles have been derived by statistical methods such as principal 
component analysis ( |01shausen and Field} |1996| [Rao and Ballard|[T998| ), independent component 
analysis ( |Simoncelli and Qlshausenl [200 1[ |Hyvarinen et al.[|2009| ) and sparse coding principles 
( |Lorincz et al. 2012 ). The theory presented in this paper can be seen as a theory at a higher level 
of abstraction, in terms of basic principles that reflect properties of the environment that in turn 
determine properties of the image data, without need for explicitly constructing specific statistical 
models for the image statistics. Specifically, the proposed theory can be used for explaining why 
the above mentioned statistical models lead to qualitatively similar types of receptive fields as the 
idealized receptive fields obtained from our theory. 

Concerning the closely related issue of how receptive fields are distributed over the visual cor- 
tex, |Kaschube et al.| ( |2010[ ) have found that pinwheel density as defined from singularities in the 



orientation fields of orientation hypercolumns is similar between species that separated evolution- 
ary more than 65 million years ago. By studying structural properties of self organizing systems 
for idealized neural interaction models, they showed that an overall suppressive nature of non-local 
long-range interactions is essential for the development of the pinwheel layout observed in carni- 
vores and primates. Thus, the distribution of orientation hypercolumns in the visual cortex can be 
predicted from internal structural properties of self-organizing neural networks. This paper presents 
a corresponding theoretical study of how the shapes of receptive field profiles found in the retina, 
LGN and the striate cortex can be predicted from structural properties of the environment and of 
how invariance properties can be achieved with a complementary assumption concerning the archi- 
tecture of complementary selection mechanisms that operate over ensembles of receptive fields. 

In terms of computational modelling of vision, the proposed model for covariant receptive fields 
leading to true invariance properties should require a significantly lower amount of training data 
compared to approaches that involve explicit learning of receptive fields or compared to computa- 
tional models that are not based on explicit invariance properties in relation to the image measure- 
ments. Specifically, we propose that if the aim is to build a computational vision system that solves 
specific visual tasks, then a neuro-inspired artificial vision system based on these types of provable 
invariance properties should allow for more robust handling of natural imaging variations. 



34 



Acknowledgements 



I would like to thank Benjamin Auffarth for valuable discussions and suggestions concerning the 
presentation. 

The support from the Swedish Research Council, Vetenskapsradet (contract 2010-4766) and 
from the Royal Swedish Academy of Sciences as well as the Knut and Alice Wallenberg Foundation 
is gratefully acknowledged. 

No conflict of interest. This research has been conducted in the absence of any commercial or 
financial relationships that could be construed as a potential conflict of interest. 

References 

A. Almansa and T. Lindeberg. Fingerprint enhancement by shape adaptation of scale-space operators with automatic 
scale-selection. IEEE Transactions on Image Processing, 9(1 2): 2027-2042, 2000. 

A. Baumberg. Reliable feature matching across widely separated views. In Proc. CVPR, pages 1:1774-1781, Hilton 
Head, SC, 2000. 

H. Bay, A. Ess, T. Tuytelaars, and van Gool. Speeded up robust features (SURF). Computer Vision and Image Under- 
standing, 110(3):346-359, 2008. 

I. Biederman and E. E. Cooper. Size invariance in visual object priming. Journal of Experimental Physiology: Human 

Perception and Performance, 18(1):121-133, 1992. 

M. C. A. Booth and E. T. Rolls. View-invariant representations of familiar objects by neurons in the inferior temporal 
visual cortex. Cerebral Cortex, 8:510-523, 1998. 

L. Bretzner and T. Lindeberg. Feature tracking with automatic selection of spatial scales. Computer Vision and Image 
Understanding, 71(3):385-392, Sep. 1998. 

M. Carandini, J. B. Demb, V. Mante, D. J. Tolhurst, Y. Dan, B. A. Olshausen, J. L. Gallant, and N. C. Rust. Do we know 
what the early visual system does. Journal of N euro science, 25(46): 10577-10597, 2005. 

G. C. DeAngelis and A. Anzai. A modern view of the classical receptive field: Linear and non-linear spatio-temporal 
processing by VI neurons. In L. M. Chalupa and J. S. Werner, editors, The Visual N euro sciences, volume 1, pages 
704-719. MIT Press, 2004. 

G. C. DeAngelis, I. Ohzawa, and R. D. Freeman. Receptive field dynamics in the central visual pathways. Trends in 
Neuroscience, 18(10):451-457, 1995. 

J. J. DiCarlo and D. D. Cox. Untangling invariant object recognition. Trends in Cognitive Science, 1 1(8):333— 341, 2007. 

J. J. DiCarlo and J. H. R. Maunsell. Form representation in monkey inferotemporal cortex is virtually unaltered by free 
viewing. Nature Neuroscience, 3(8):814— 821, 2000. 

E. Doi and M. S. Lewicki. Relations between the statistical regularities of natural images and the response properties of 
the early visual system. In Japanese Cognitive Science Society: Sig P &P, pages 1-8, Kyoto University, 2005. 

S. Edelman and H. H. Biilthoff. Orientation dependence in the recognition of famililar and novel views of three- 
dimensional objects. Vision Research, 32(12):2385-2400, 1992. 

W. Einhauser and R Konig. Getting real — sensory processing of natural stimuli. Current Opinion in Neurobiology, 20 
(3):389-395, 2010. 

A. Einstein. Relativity: the special and the general theory. New York: Henry Holt. Reprinted by Bartleby.com, 2000, 
1920. Translated by Robert W. Lawson. 

O. Faugeras, J. Toubol, and B. Cessac. A constructive mean-field analysis of multi-population neural net- 
works with random synaptic weights and stochastic inputs. Frontiers in Computational Neuroscience, 3(1): 
10.3389/neuro. 10.001.2009, 2009. 

D. J. Field. Relations between the statistics of natural images and the response properties of cortical cells. /. of the 
Optical Society of America, 4:2379-2394, 1987. 



35 



L. M. J. Florack. Image Structure. Series in Mathematical Imaging and Vision. Springer, 1997. 

K. Friston. The history of the future of the Baysian brain. Neurolmage, 2011. doi: 10.1016/j.neuroimage.201 1.10.004. 

C. S. Furmanski and S. A. Engel. Perceptual learning in object recognition: Object specificity and size invariance. Vision 
Research, 40:473-484, 2000. 

J. Garding and T. Lindeberg. Direct computation of shape cues using scale-adapted spatial derivative operators. Int. J. of 
Computer Vision, 17(2): 163-191, 1996. 

T. J. Gawne and J. M. Martin. Responses of primate visual cortical V4 neurons to simultaneously presented stimuli. 
Journal of Neurophysiology, 88(1): 1 128-1 135, 1993. 

W. S. Geisler. Visual perception and the statistical properties of natural scenes. Annual Review of Psychology, 59: 
10.1-10.26, 2008. 

R. L. T. Goris and H. P. Op de Beeck. Neural representations that support invariant object recognition. Frontiers in 
Computational Neuroscience, 3(Article 3): 1-16, 2009. 

D. B. Grimes and R. P. N. Rao. Bilinear sparse coding for invariant vision. Nature Neuroscience, 3(8):814-821, 2005. 

E. Hille and R. S. Phillips. Functional Analysis and Semi-Groups, volume XXXI. American Mathematical Society 
Colloquium Publications, 1957. 

D. H. Hubel and T. N. Wiesel. Receptive fields of single neurones in the cat's striate cortex. J Physiol, 147:226-238, 
1959. 

D. H. Hubel and T. N. Wiesel. Receptive fields, binocular interaction and functional architecture in the cat's visual cortex. 
J Physiol, 160:106-154, 1962. 

D. H. Hubel and T. N. Wiesel. Brain and Visual Perception: The Story of a 25 -Year Collaboration. Oxford University 
Press, 2005. 

C. P. Hung, G. Kreiman, T. Poggio, and J. J. DiCarlo. Fast readout of object indentity from macaque inferior temporal 
cortex. Science, 310:863-866, 2005. 

J. B. Hurley. Shedding light on adaptation. Journal of Generative Physiology, 119:125-128, 2002. 

A. Hyvarinen, J. Hurri, and P. O. Hoyer. Natural Image Statistics: A Probabilistic Approach to Early Computational 
Vision. Computational Imaging and Vision. Springer, 2009. 

M. Ito, H. Tamura, I. Fujita, and K. Tanaka. Size and position invariance of neuronal responses in monkey inferotemporal 
cortex. Journal of Neurophysiology, 73(l):218-226, 1995. 

J. Jones and L. Palmer. The two-dimensional spatial structure of simple receptive fields in cat striate cortex. /. of 
Neurophysiology, 58:1187-1211, 1987a. 

J. Jones and L. Palmer. An evaluation of the two-dimensional Gabor filter model of simple receptive fields in cat striate 
cortex. /. of Neurophysiology, 58:1233-1258, 1987b. 

M. Kaschube, M. Schnabel, S. Lowel, D. M. Coppola, L. E. White, and F. Wolf. Universality in the evolution of 
orientation columns in the visual cortex. Science, 330: 1 1 13-1 116, 2010. 

J. J. Koenderink. The structure of images. Biological Cybernetics, 50:363-370, 1984. 

J. J. Koenderink and A. J. van Doom. Representation of local geometry in the visual system. Biological Cybernetics, 55: 
367-375, 1987. 

J. J. Koenderink and A. J. van Doom. Generic neighborhood operators. IEEE Trans. Pattern Analysis and Machine 
Intell, 14(6):597-605, Jun. 1992. 

L. Lagae, S. Raiguel, and G. A. Orban. Speed and direction selectivity of macaque middle temporal neurons. Journal of 
Neurophysiology, 69(1): 19-39, 1993. 

I. Laptev and T. Lindeberg. Space-time interest points. In Proc. 9th Int. Conf. on Computer Vision, pages 432-439, Nice, 
France, Oct. 2003. 

I. Laptev and T. Lindeberg. Velocity- adapted spatio-temporal receptive fields for direct recognition of activities. Image 
and Vision Computing, 22(2): 105-1 16, 2004a. 



36 



I. Laptev and T. Lindeberg. Local descriptors for spatio-temporal recognition. In Proc. ECCV'04 Workshop on Spatial 
Coherence for Visual Motion Analysis, volume 3667 of Lecture Notes in Computer Science, pages 91-103, Prague, 
Czech Republic, May. 2004b. Springer. 

I. Laptev, B. Caputo, C. Schuldt, and T. Lindeberg. Local velocity- adapted motion events for spatio-temporal recognition. 
Computer Vision and Image Understanding, 108:207-229, 2007. 

S. Lazebnik, C. Schmid, and J. Ponce. A sparse texture representation using local affine regions. IEEE Trans. Pattern 
Analysis and Machine Intell, 27(8): 1265-1278, 2005. 

O. Linde and T. Lindeberg. Object recognition using composed receptive field histograms of higher dimensionality. In 
International Conference on Pattern Recognition, volume 2, pages 1-6, Cambridge, Aug. 2004. 

O. Linde and T. Lindeberg. Composed complex-cue histograms: An investigation of the information content in receptive 
field based image descriptors for object recognition. Computer Vision and Image Understanding, 1 16:538-560, 2012. 

T. Lindeberg. Scale-Space Theory in Computer Vision. The Kluwer International Series in Engineering and Computer 
Science. Springer, 1994a. 

T. Lindeberg. Scale-space theory: A basic tool for analysing structures at different scales. Journal of Applied Statistics, 
2 1(2): 225-270, 1994b. Also available from http://www.csc.kth.se/-tony/abstracts/Lin94-SI-abstract.html. 

T. Lindeberg. Linear spatio-temporal scale-space. In B. M. ter Haar Romeny, L. M. J. Florack, J. J. Koenderink, and 
M. A. Viergever, editors, Scale-Space Theory in Computer Vision: Proc. First Int. Conf. Scale-Space' 97, volume 
1252 of Lecture Notes in Computer Science, pages 1 13-127, Utrecht, The Netherlands, Jul. 1997. Springer. Extended 
version available as technical report ISRN KTH NA/P-01/22-SE from KTH. 

T. Lindeberg. Feature detection with automatic scale selection. Int. J. of Computer Vision, 30(2):77-116, 1998. 

T. Lindeberg. Principles for automatic scale selection. In Handbook on Computer Vision and Applications, pages 239- 
274. Academic Press, Boston, USA, 1999. Also available from http://www.csc.kth.se/cvap/abstracts/cvap222.html. 

T. Lindeberg. Scale-space. In Benjamin Wah, editor, Encyclopedia of Computer Science and Engineering, pages 2495- 
2504. John Wiley and Sons, Hoboken, New Jersey, 2008. dx.doi.org/10. 1002/97804700501 18.ecse609 Also available 
from http://www.nada.kth.se/-tony/abstracts/Lin08-EncCompSci.html. 

T. Lindeberg. Generalized Gaussian scale-space axiomatics comprising linear scale-space, affine scale-space and spatio- 
temporal scale-space. /. of Mathematical Imaging and Vision, 40(1):36-81, 2011. 

T. Lindeberg. A computational model of visual receptive fields. 2012. submitted to Biological Cybernetics. 

T. Lindeberg and L. Bretzner. Real-time scale selection in hybrid multi- scale representations. In L. Griffin and M. Lill- 
holm, editors, Proc. Scale-Space Methods in Computer Vision: Scale-Space' 03, volume 2695 of Lecture Notes in 
Computer Science, pages 148-163, Isle of Skye, Scotland, 2003. Springer. 

T. Lindeberg and J. Garding. Shape-adapted smoothing in estimation of 3-D depth cues from affine distortions of local 
2-D structure. Image and Vision Computing, 15:415-434, 1997. 

T. Lindeberg, A. Akbarzadeh, and I. Laptev. Galilean-corrected spatio-temporal interest operators. In International 
Conference on Pattern Recognition, pages 1:57-62, Cambridge, 2004. 

N. K. Logothetis, J. Pauls, and T. Poggio. Shape representation in the inferior temporal cortex of monkeys. Current 
Biology, 5(2):552-563, 1995. 

A. Lorincz, Z. Palotal, and G. Szirtes. Efficient sparse coding in early sensory processing: Lessons from signal recovery. 
PLoS Computational Biology, 8(3)(el002372), 2012. 10. 1371/journal.pcbi. 1002372. 

D. Lowe. Distinctive image features from scale-invariant keypoints. Int. J. of Computer Vision, 60(2):91-110, 2004. 

B. D. Lukas and T. Kanade. An iterative image registration technique with an application to stereo vision. In Image 
Understanding Workshop, 1981. 

S. Marcelja. Mathematical description of the responses of simple cortical cells. /. of the Optical Society of America, 70 
(11): 1297-1300, 1980. 

M. Mattia and P. D. Guidice. Population dynamics of interacting spiking neurons. Physics Review E, 66(5):051917, 
2002. 



37 



K. Mikolajczyk and C. Schmid. Scale and affine invariant interest point detectors. Int. J. of Computer Vision, 60(1): 
63-86,2004. 

A. Negre, C. Braillon, J. L. Crowley, and C. Laugier. Real-time time-to-collision from variation of intrinsic scale. 
Experimental Robotics, 39:75-84, 2008. 

B. A. Olshausen and D. J. Field. Emergence of simple-cell receptive field properties by learning a sparse code for natural 
images. /. of the Optical Society of America, 381:607-609, 1996. 

B. A. Olshausen, C. H. Anderson, and D. C. van Essen. A neurobiological model of visual attention and invariant pattern 
recognition based on dynamic routing of information. Journal of Neuroscience, 13(11):4700-4719, 1993. 

A. Omurtag, B. W. Knight, and L. Sirovich. On the simulation of large populations of neurons. Journal of Computational 
Neuroscience, 8:51-63, 2000. 

S. E. Palmer. Vision Science: Photons to Phenomenology. MIT Press, 1999. First Edition. 

S. E. Petersen, J. F. Baker, and J. M. Allman. Direction- specific adaptation in area MT of the owl monkey. Brain 
Research, 346:146-150, 1985. 

R. Q. Quiroga, L. Reddy, G. Kreiman, C. Koch, and I. Fried. Invariant visual representations by single neurons in the 
human brain. Nature, 435:1102-1107, 2005. 

R. P. N. Rao and D. H. Ballard. Development of localized oriented receptive fields by learning a translation-invariant 
code for natural images. Computation in Neural Systems, 9(2):219-234, 1998. 

M. Riesenhuber and T. Poggio. Hierarchical models of object recognition in cortex. Nature, 2(11): 1019-1025, 1999. 

R. W. Rodieck. Quantitative analysis of cat retinal ganglion cell response to visual stimuli. Vision Research, 5(11): 
583-601,1965. 

H. R. Rodman and T. D. Albright. Single-unit analysis of pattern-motion selective properties in the middle temporal 
visual area (mt). Experimental Brain Research, 75:53-64, 1989. 

E. T. Rolls. Brain mechanisms for invariant visual recognition and learning. Behavioural Processes, 33:113-138, 1994. 

F. Rothganger, S. Lazebnik, C. Schmid, and J. Ponce. 3D object modeling and recognition using local affine-invariant 
image descriptors and multi-view spatial constraints. Int. J. of Computer Vision, 66(3):23 1-259, 2006. 

B. Schiele and J. Crowley. Recognition without correspondence using multidimensional receptive field histograms. Int. 
J. of Computer Vision, 36(l):31-50, 2000. 

E. P. Simoncelli and B. A. Olshausen. Natural image statistics and neural representations. Annual Review of Neuro- 
science, 24:1193-1216, 2001. 

J. B. J. Smeets and E. Brenner. The difference between the perception of absolute and relative motion: A reaction time 
study. Vision Research, 34(2): 191-195, 1994. 

D. C. Somers, S. B. Nelson, and M. Sur. An emergent model of orientation selectivity in cat visual cortical simple cells. 
Journal of Neuroscience, 15(8):5448-5465, 1995. 

H. Sompolinsky and R. Shapley. New perspectives on the mechanisms for orientation selectivity. Current Opinion in 
Neurobiology, 7:514-522, 1997. 

D. G. Stork and H. R. Wilson. Do Gabor functions provide appropriate descriptions of visual cortical receptive fields. /. 
of the Optical Society of America, 7(8): 1362-1373, 1990. 

B. ter Haar Romeny. Front-End Vision and Multi-Scale Image Analysis. Springer, 2003. 

T. Tuytelaars and L. van Gool. Matching widely separated views based on affine invariant regions. Int. J. of Computer 
Vision, 59(l):61-85, 2004. 

A. van der Schaaf and J. H. van Hateren. Modelling the power spectra of natural images: Statistics and information. 
Vision Research, 36(17):2759-2770, 1996. 

J. Weickert. Anisotropic Diffusion in Image Processing. Teubner-Verlag, Stuttgart, Germany, 1998. 

G. Willems, T. Tuytelaars, and L. van Gool. An efficient dense and scale-invariant spatio-temporal interest point detector. 
In Proc. ECCV'08, volume 5303 of Lecture Notes in Computer Science, pages 650-663, Marseille, France, 2008. 
Springer. 



38 



L. Wiskott. How does our visual system achieve shift and size invariance? In J. L. van Hemmen and T. J. Sejnowski, 
editors, Problems in Systems Neuroscience. Oxford University Press, 2004. 

A. P. Witkin. Scale-space filtering. In Proc. 8th Int. Joint Conf. Art. Intell, pages 1019-1022, Karlsruhe, Germany, Aug. 
1983. 

R. A. Young. The Gaussian derivative model for spatial vision: I. Retinal mechanisms. Spatial Vision, 2:273-293, 1987. 

R. A. Young and R. M. Lesperance. The Gaussian derivative model for spatio-temporal vision: II. Cortical data. Spatial 
Vision, 14(3, 4):321-389, 2001. 

R. A. Young, R. M. Lesperance, and W. W. Meyer. The Gaussian derivative model for spatio-temporal vision: I. Cortical 
model. Spatial Vision, 14(3, 4):261-319, 2001. 



39 



